Dominic Szablewski, @phoboslab

— Thursday, February 2nd 2017

Decode It Like It's 1999

A few years ago I started to work on an MPEG1 Video decoder, completely written in JavaScript. Now, I finally found the time to clean up the library, improve its performance, make it more error resilient and modular and add an MP2 Audio decoder and MPEG-TS demuxer. This makes this library not just an MPEG decoder, but a full video player.

In this blog post I want to talk a bit about the challenges and various interesting bits I discovered during the development of this library. You'll find a demo, the source and documentation and reasons why to use JSMpeg over on the official website:

jsmpeg.com - Decode it like it's 1999

Refactoring

Recently, I needed to implement audio streaming into JSMpeg for a client and only then realized in what a pitty state the library is. It has grown quite a bit since its first release. A WebGL renderer, WebSocket client, progressive loading, benchmarking facilities and much more have been tacked on in the last few years. All kept in a single, monolithic class with conditionals bursting at the seams.

I decided to clean up this mess first by separating its logical components. I also sketched out what would be needed for the sound implementation: a Demuxer, MP2 decoder and Audio Output:

Sources: AJAX, progressive AJAX and WebSocket
Demuxers: MPEG-TS (Transport Stream)
Decoders: MPEG1 Video & MP2 Audio
Renderers: Canvas2D & WebGL
Audio Output: WebAudio

Plus some auxiliary classes:

A Bit Buffer, managing the raw data
A Player, orchestrating the other components

Each of the components (apart from the Sources) has a .write(buffer) method to feed it with data. These components can then "connect" to a destination that receives the processed result. The complete flow through the library looks like this:

                 / -> MPEG1 Video Decoder -> Renderer
Source -> Demuxer  
                 \ -> MP2 Audio Decoder -> Audio Output

JSMpeg currently has 3 different implementations for the Source (AJAX, AJAX progressive and WebSocket) and there's 2 different Renderers (Canvas2D and WebGL). The rest of the library is agnostic to these – i.e. the Video Decoder doesn't care about the Renderers internals. With this approach it's easy to add new components: further Sources, Demuxers, Decoders or Outputs.

I'm not completely happy with how these connections work in the library. Each component can only have one destination (apart from the Demuxer, that has one destination per stream). It's a tradeoff. In the end, I felt that anything else would be over engineering and complicating the library for no good reason.

WebGL Rendering

One of the most computationally intensive tasks for an MPEG1 decoder is the color conversion from MPEG's internal YUV format (Y'Cr'Cb to be precise) into RGBA so that the browser can display it. Somewhat simplified, the conversion looks like this:

for (var i = 0; i < pixels.length; i+=4 ) {
    var y, cb, cr = /* fetch this from the YUV buffers */;

    pixels[i + 0 /* R */] = y + (cb + ((cb * 103) >> 8)) - 179;
    pixels[i + 1 /* G */] = y - ((cr * 88) >> 8) - 44 + ((cb * 183) >> 8) - 91;
    pixels[i + 2 /* B */] = y + (cr + ((cr * 198) >> 8)) - 227;
    pixels[i + 4 /* A */] = 255;
}

For a single 1280x720 video frame that loop has to run 921600 times to convert all pixels from YUV to RGBA. Each pixel needs 3 writes to the destination RGB array (we can pre-populate the alpha component since it's always 255). That's 2.7 million writes per frame, each needing 5-8 adds, subtracts, multiplies and bit shifts. For a 60fps video, we end up with more than 1 billion operations per second. Plus the overhead for JavaScript. The fact that JavaScript can do this, that a computer can do this, still boggles my mind.

With WebGL, this color conversion (and subsequent displaying on the screen) can be sped up tremendously. A few operations for each pixel is the bread and butter of GPUs. GPUs can process many pixels in parallel, because they're independent of any other pixel. The WebGL shader that's run on the GPU doesn't even need these pesky bit shifts – GPUs likes floating point numbers:

void main() {
    float y = texture2D(textureY, texCoord).r;
    float cb = texture2D(textureCb, texCoord).r - 0.5;
    float cr = texture2D(textureCr, texCoord).r - 0.5;

    gl_FragColor = vec4(
        y + 1.4 * cb,
        y + -0.343 * cr - 0.711 * cb,
        y + 1.765 * cr,
        1.0
    );
}

With WebGL, the time needed for the color conversion dropped from 50% of the total JS time to just about 1% for the YUV texture upload.

There was one minor issue I stumbled over with the WebGL renderer. JSMpeg's video decoder does not produce three Uint8Arrays for each color plane, but Uint8ClampedArrays. It's doing this, because the MPEG1 standard mandates that decoded color values must be clamped, not wrap around. Letting the browser do the clamping through the ClampedArray works out faster than doing it in JavaScript.

A bug that still stands in some Browsers (Chrome and Safari) prevents WebGL from using the Uint8ClampedArray directly. Instead, for these browsers we have to create a Uint8Array view for each array for each frame. This operation is pretty fast since nothing needs to be copied, but I'd still like to do without it.

JSMpeg detects this bug and only uses the workaround if needed. We simply try to upload a clamped array and catch the error. This detection sadly triggers an un-silencable warning in the console, but it's better than nothing.

WebGLRenderer.prototype.allowsClampedTextureData = function() {
    var gl = this.gl;
    var texture = gl.createTexture();

    gl.bindTexture(gl.TEXTURE_2D, texture);
    gl.texImage2D(
        gl.TEXTURE_2D, 0, gl.LUMINANCE, 1, 1, 0,
        gl.LUMINANCE, gl.UNSIGNED_BYTE, new Uint8ClampedArray([0])
    );
    return (gl.getError() === 0);
};

WebAudio for Live Streaming

For the longest time I assumed that in order to feed WebAudio with raw PCM sample data without much latency or pops and cracks, you'd have to use a ScriptProcessorNode. You'd copy your decoded sample data just in time whenever you get the callback from the script processor. It works. I tried it. It needs quite a bit of code to function properly and of course it's computationally intensive and inelegant.

Luckily, my initial assumption was wrong.

The WebAudio Context maintains its own timer that's separate from JavaScript's Date.now() or performance.now(). Further, you can instruct your WebAudio sources to start() at a precise time in the future based on the context's time. With this, you can string very short PCM buffers together without any artefacts.

You only have to calculate the start time for the next buffer by continuously adding the duration of all previous ones. It's important to always use the WebAudio Context's own time for this.

var currentStartTime = 0;

function playBuffer(buffer) {
    var source = context.createBufferSource();
    /* load buffer, set destination etc. */

    var now = context.currentTime;
    if (currentStartTime < now) {
        currentStartTime = now;
    }

    source.start(currentStartTime);
    currentStartTime += buffer.duration;
}

There's a caveat though: I needed to get the precise remaining duration of the enqueued audio. I implemented it simply as the difference between the current time and the next start time:

// Don't do that!
var enqueuedTime = (currentStartTime - context.currentTime);

It took me a while to figure it out, but this doesn't work. You see, the context's currentTime is only updated every so often. It's not a precise real time value.

var t1 = context.currentTime;
doSomethingForAWhile();
var t2 = context.currentTime;

t1 === t2; // true

So, if you need the precise audio play position (or anything based on it), you have to revert to JavaScript's performance.now().

Audio Unlocking on iOS

You gotta love the shit that Apple throws into Web devs faces from time to time. One of those things is the need to unlock audio on a page before you can play anything. Basically, audio playback can only be started as a response to a user action. You click on a button, audio plays.

This makes sense. I won't argue against it. You don't want to have audio blaring at you unannounced when you visit a page.

What makes it shitty, is that Apple neither provided a way to cleanly unlock Audio nor a way to ask the WebAudio Context if it's unlocked already. What you do instead, is to play an Audio source and continually check if it's progressing. You can't chek immediately after playing, though. No, no. You have to wait a bit!

WebAudioOut.prototype.unlock = function(callback) {
    // This needs to be called in an onclick or ontouchstart handler!
    this.unlockCallback = callback;

    // Create empty buffer and play it
    var buffer = this.context.createBuffer(1, 1, 22050);
    var source = this.context.createBufferSource();
    source.buffer = buffer;
    source.connect(this.destination);
    source.start(0);

    setTimeout(this.checkIfUnlocked.bind(this, source, 0), 0);
};

WebAudioOut.prototype.checkIfUnlocked = function(source, attempt) {
    if (
        source.playbackState === source.PLAYING_STATE || 
        source.playbackState === source.FINISHED_STATE
    ) {
        this.unlocked = true;
        this.unlockCallback();
    }
    else if (attempt < 10) {
        // Jeez, what a shit show. Thanks iOS!
        setTimeout(this.checkIfUnlocked.bind(this, source, attempt+1), 100);
    }
};

Progressive Loading via AJAX

Say you have a 50mb video file that you load via AJAX. The video starts loading no problem. You can even check the current process (downloaded vs. total bytes) and display a nice loading animation. What you can not do, is to access the already downloaded data while the rest of the file is still loading.

There have been some proposals for adding chunked ArrayBuffers into XMLHttpRequest, but nothing has been implemented across browsers. The newer fetch API (that I still don't understand the purpose of) proposed some similar features, but again: no cross browser support. However, we can still do the chunked downloading in JavaScript using Range-Requests.

The HTTP standard implements a Range header that allows you to only grab part of a resource. If you just need the first 1024 bytes of a big file, you set the header Range: bytes=0-1024 in your request. Before we can start though, we have to figure out how large the file. We can do this with a HEAD request, instead of a GET. This returns only the HTTP headers for the resource, but none of the body bytes. Range-Requests are supported by almost all HTTP servers. The one exception I know of, is PHP's built-in development server.

JSMpeg's default chunk size for downloading via AJAX is 1mb. JSMpeg also appends a custom GET parameter to the URL (e.g. video.ts?0-1024) for each request, so that each chunk essentially gets its own URL and plays nice with bad caching proxies.

With this in place, you can start playing the file as soon as the first chunk has arrived. Also, further chunks will only be downloaded when they're needed. If someone only watches the first few seconds of a video, only those first few seconds will get downloaded. JSMpeg does this by measuring the time it took to load a chunk, adding a lot of safety margin and comparing this to the remaining duration of the already loaded chunks.

In JSMpeg, the Demuxer splits streams as fast as it can. It also decodes the presentation time stamp (PTS) for each packet. The video and audio decoders however only advance their play position in real-time increments. The difference between the last demuxed PTS and the decoder's current PTS is the remaining play time for the downloaded chunks. The Player periodically call's the Source's resume() method with this headroom time:

// It's a silly estimate, but it works
var worstCaseLoadingTime = lastChunkLoadingTime * 8 + 2;
if (worstCaseLoadingTime > secondsHeadroom) {
    loadNextChunk();
}

Audio & Video Sync

JSMpeg tries to play audio as smoothly as possible. It doesn't introduce any gaps or compressions when queuing up samples. Video playback orients itself on the audio playback position. It's done this way, because even the tiniest gaps or discontinuities are far more perceptible in audio than in video. It's far less jarring if a video frame is a few milliseconds late or dropped.

For the most part, JSMpeg relies on the presentation time stamp (PTS) of the MPEG-TS container for playback, instead of calculating the playback time itself. This means, the PTS in the MPEG-TS file have to be consistent and accurate. From what I gathered from the internet, this is not always the case. But modern encoders seemed to have figured this out.

One complication was that the PTS doesn't always start at 0. For instance, if you have a WebCam connected and running for a while, the PTS may be the start time when the WebCam was turned on, not when recording started. Therefore, JSMPeg searches for the first PTS it can find and uses that as the global start time for all streams.

The MPEG1 and MP2 decoders also keep track of all PTS they received alongside with the buffer position of each PTS. With this, we can seek through the audio and video streams to a specific time.

Currently, JSMpeg will happily seek to an inter-frame and decode it on top of the previously decoded frame. The correct way to handle this, would be to rewind to the last intra-frame before the one we seek to and decode all frames in between. This is something I still need to fix.

Build Tools & The JavaScript Ecosystem

I avoid build tools wherever I can. Chances are, your shiny toolset that automates everything for you, will stop working in a year or two. Setting up your build environment is never as easy as "just call webpack", or use grunt or whatever task runner is the hot shit today. It always ends up like

(...) Where do I get webpack from? Oh, I need npm.
Where do I get npm from? Oh, I need nodejs.
Where do I get nodejs from? Oh, I need, homebrew.
What's that? gyp build error? Oh, sure, I need to install XCode.
Oh, webpack needs the babel plugin?
What? The left-pad dependency could not be resolved?
...

And suddenly you spent two hours of your life and downloaded several GB of tools. All to build a 20kb library, for a language that doesn't even need compiling. How do I build this library 2 years from now? 5 years?

I had a thorough look at webpack and hated it. It's way too complex for my taste. I like to understand what's going on. That's part of the reason I wrote this library instead of diving into WebRTC.

So, the build step for JSMpeg is a shell script with a single call to uglifyjs that can be altered to use cat (or copy on Windows) in 2 seconds. Or you simply load the source files separately in your HTML while you're working on it. Done.

Quality, Bitrates And The Future

The quality of MPEG1 at reasonable bitrates is, much to my surprise, not bad at all. Have a look at the demo video on jsmpeg.com - granted, it's a favorable case for compression. Slow movement and not too many cuts. Still, this video weighs in at 50mb for it's 4 minutes, and provides a quality comparable to most Youtube videos that are "only" 30% smaller.

In my tests, I could always get video that I'd consider "high quality" at max 2Mbit/s. Depending on your use-case (want a coffee cam?), you can go to 100Kbit/s or even lower. There's no bottom limit for the bitrate/framerate.

You could get a cheap cell phone contract with a 1GB/month data limit, put a 3G dongle and a webcam on a Raspberry Pi, attach it to a 12 V automotive battery, throw it on your crops field and get a live weather cam that doesn't need any infrastructure or maintenance for a few years and is viewable in your smartphone's browser without installing anything.

The simplicity of MPEG1, compared to modern codecs, makes it very attractive in my opinion. It's well understood and there's a ton of tools that can work with it. All patents relating to MPEG1/MP2 have expired now. It's a free format.

Do you remember the GIF revival after its patents expired?