PHOBOSLAB

Underrun – Making Of

I participated in this year's js13kGames, a JavaScript game development competition with a file size limit of 13kb, including code, assets and everything else. My entry was Underrun, a twin stick shooter using WebGL.

Underrun Play Underrun – A WebGL shooter in 13kb of JavaScript

For this competition I set out to produce something with a dense atmosphere – which is inherently difficult to do with so little space. In my experience a dense atmosphere is most easily achieved with two things: moody lighting and rich sound & music. The assets for sound & music usually take up a lot of space and lighting for 2D games requires more and bigger graphic assets. With only 13kb to space I had to look for different solutions for these things.

Graphic Assets

Since I was very limited with the amount of graphic assets I could fit in this game the decision to implement a 3D perspective came naturally: atmospheric lighting is easier to do in 3D than in 2D and requires less assets to look neat. In contrast, to produce interesting light effects in 2D typically requires you to implement a third dimension anyway. This can be done through normal maps as some 2D Pixel Art games do or by explicitly separating your environment into different layers as Teleglitch does for example.

My approach for the graphics was to render a simple 2D tilemap with a bunch of textured quads. All walls are rendered as cubes and entities (the player, enemies, projectiles, particles and health pickups) are just simple sprites. All textures for the game fit in a single tile sheet, 2.12kb in size.

Underrun Assets All graphic assets for the game, carefully tuned to 16 colors

I opted to render the player character as a sprite as well, even though it needs to be freely rotated. It was far easier to produce 8 different sprites – one for each direction – than to implement a loader and renderer for complex 3D models. This allowed me to omit any 3D vector or matrix math operations. The game never needs to rotate any geometry. Everything is rendered with axis aligned quads.

To make the player character rotation look convincing I build a 3D model and simply took screenshots for each of the different perspectives. I'm a total doofus when it comes to 3D modeling software, but that doesn't stop me from using the ingenious Wings3D. The result doesn't look like much, but scaled down to 16px, it doesn't matter.

title The player character model built in Wings3D

The final game features 3 different levels, each 64×64 tiles in size. When I started to work on the game I considered to use Run Length Encoding to compress the level maps. However, even as simple as RLE is, a decompressor for it would have taken up some valuable space. So my solution was to just let the browser handle the decompression by using PNG images.

title Underrun's levels are stored as PNG images

With the PNG compression, each level image is just about 300bytes in size. A naive approach would have needed 64×64 bytes = 4kb per level (assuming storage as raw binary, 1byte per tile).

When loading the level a random number generator (RNG) is used to determine exactly which tile graphic to use for each floor and wall. The RNG also decides where to place enemies and powerups, as these are not encoded in the level images. To make this consistent between playthroughs, so that each player plays exactly the same game, I implemented a tiny seedable RNG (source code) and re-seeded it with the same constant before loading each level.

I also hardcoded some of the wall tiles to be rendered with twice the normal height to make it look more interesting. You can see the complete level loading routines in the load_level() function.

Rendering

The renderer follows an extremely simple design: a single buffer is used to store all vertices, texture coordinates and normals. A second buffer is used to store all light sources (position, color, attenuation). The game can write to these buffers using the push_quad() and push_light() functions. The renderer clears these buffers at the beginning of each frame and draws the (filled) buffers at the end of the frame.

There's one small optimization to this. During level load, since all the level geometry is static, the game sets a reset mark for the geometry buffer after all the level geometry has been pushed. Before each frame the renderer will then reset the write position for the geometry buffer to this reset mark instead of to 0. This clears all the entities, but we don't we don't have to re-push the level data.

Since the game's graphics are quite simple, there's no need for any occlusion culling or any other optimizations to reduce the amount of geometry being drawn. Underrun draws the whole level – all floor an ceiling tiles and all sprites – for every single frame in a single draw call.

The gist of the renderer looks like this:

var 
    num_verts = 0,
    level_num_verts = 0,
    max_verts = 1024 * 64,
    buffer_data = new Float32Array(max_verts * 8); // 8 floats per vert


// Push a new quad into the buffer at the current num_verts position.
function push_quad(x1, y1, z1, ...) {
    buffer_data.set([x1, y1, z1, ...], num_verts * 8);
    num_verts += 6;
}

// Load a level, push some quads.
function load_level(data) {
    num_verts = 0;

    for (var y = 0; y < 64; y++) {
        for (var x = 0; x < 64; x++) {
            // lots of push_quad()...
        }
    }

    level_num_verts = num_verts; // set reset pos to current write pos
}

// Resets the buffer write position to just after the level geometry.
function renderer_prepare_frame() {
    num_verts = level_num_verts;
}

// Hand over the buffer data, up to num_verts, to webgl.
function renderer_end_frame() {
    gl.bufferData(gl.ARRAY_BUFFER, buffer_data, gl.DYNAMIC_DRAW);
    gl.drawArrays(gl.TRIANGLES, 0, num_verts);
};

// Run a frame
function tick() {
    renderer_prepare_frame();

    for (var i = 0; i < entities.length; i++) {
        entities[i].update();
        entities[i].draw();
    }

    renderer_end_frame();

    requestAnimationFrame(tick);
}

For the lighting I opted for some very simple vertex lights. Since the game only renders relatively small quads, it doesn't matter much that the light computation is handled in the vertex shader instead of per-pixel in the fragment shader. This also allowed for many more light sources. The game currently allows for 32 lights, each of which considered for every single vertex. Current GPUs are so stupidly fast that no optimizations are needed for such simple cases.

The light calculation in Underrun's Vertex Shader looks somewhat like this:

// 7 values for each light:  position (x, y, z), color (r, g, b), attenuation
uniform float lights[7 * max_lights];
varying vec3 vertex_light;

void main(void){
    vertex_light = vec3(0.3, 0.3, 0.6); // ambient color

    // apply each light source to this vertex
    for(int i = 0; i < max_lights; i++ ) {
        vec3 light_position = vec3(lights[i*7], lights[i*7+1], lights[i*7+2]);
        vec3 light_color = vec3(lights[i*7+3], lights[i*7+4], lights[i*7+5]);
        vec3 distance = length(light_position - vertex_position);
        float attenuation= (1.0/(lights[i*7+6] * distance);

        vertex_light += 
            light_color
            * max(dot(normal, normalize(dist)), 0.0) // diffuse
            * attenuation;
    }

    // ...
}

For some extra scary atmosphere the fragment shader calculates a simple black fog based on the distance to the camera. An exception is made to the enemies' eyes: all full red texture pixels are always rendered unchanged – no light, no fog. This makes them essentially glow in the dark.

The fragment shader also reduces the final image to 256 colors. I experimented with some color tinting but ultimately settled for a uniform rgb palette. Rendering in 256 colors and a screen size of just 320×180 pixels not only gives the game an old-school vibe, but also hides a lot of the graphical shortcomings.

Have a look at the source code for the renderer.js and game.js - there's really not that much to it.

Music & Sounds

The sound and music for all my previous games made up the bulk of the data size. E.g. Xibalba has about 800kb of graphic assets (tile sheets, sprites, title screen etc.) but more than 4mb of music and 2.5mb of sound effects (actually 8mb and 5mb respectively, since all assets need to be provided as .ogg and .mp3). This was of course infeasible when you only have 13kb total to work with.

So for Underrun all music and sound effects are generated in JavaScript when the game starts. The game uses the brilliant Sonant-X library to render instruments from a bunch of parameters into sound waves.

My dear friend Andreas Lösch of no-fate.net produced the instruments and sequences for the music in the tracker Sonant-X Live. The sound effects were produced in Sonant-X Live as well; each effect being a single instrument.

Sonant-X Underrun's Music, produced in the Sonant-X Live tracker

To save some more space I began to remove all the parts from Sonant-X that were not needed for my game, leaving only the WebAudio generator. In doing so I noticed that the library was unnecessary slow, taking half a second to generate an empty buffer object. The culprit was the use of a single, interleaved Uint8 Buffer for the left and right audio channel, storing unsigned 16bit values with the implicit zero amplitude at 32767.

The library was spending a lot of time loading from and storing 16bit values into this 8bit buffer and later converting it to a signed float suitable for WebAudio. I believe this is an artifact from the original use case of this library: producing a WAV data URI.

I refactored the library to use two Float32Arrays (one for each channel) that can be directly used by the WebAudio API in the browser. This not only simplified the code and reduced the file size by 30% but also made the music generation twice as fast.

As an example, have a look at the original applyDelay() function and my modified one - granted, I also removed the pausing/re-scheduling functions here that originally prevented blocking the event loop with a long running computation, but it wasn't really needed anymore.

All in all, minified and gzipped, the music and sound for Underrun along with the Sonant-X library now occupy only about 2kb. That's a pretty big deal, considering all my previous games' biggest asset were sound and music. So even if your game is not size restricted as much, it may make sense to generate audio directly in your game, instead of adding big files to the download.

Minification

When I started this project I took great care to write as little code as possible and to make sure all my code could be minified effectively by UglifyJS. The source has quite an unusual style compared to my other JS projects. It's using a lot of global variables and functions, instead of abstracting it more neatly into classes. I also wrote the code in the more C-like snake_case style to force my brain into this different mode.

In hindsight, this was totally unnecessarily. Minifying and zipping would have gotten rid of most of the overhead of using more abstractions.

One trick that made a small difference is the "inlining" of all WebGL constants at build-time. E.g. gl.ONE_MINUS_SRC_ALPHA is replaced with just 771 - the constant's value. I also shortened all the WebGL function calls by producing aliases that just contained the first two letters and all subsequent upper-case letters of the function's original name:

for (var name in gl) {
    if (gl[name].length !== undefined) { // is function?
        gl[name.match(/(^..|[A-Z]|\d.|v$)/g).join('')] = gl[name];
    }
}

This allowed me to use gl.getUniformLocation() as just gl.geUL(). Note though that this approach is quite hack-ish and produces some name collisions. But for my game it worked out nicely.

I also took a lot of shortcuts with the game logic and especially collision detection. Collision detection between entities is just using two nested loops, checking each entity against all other entities. This quadratic function quickly explodes with a lot of entities in the level (e.g. 100 entities = 10.000 comparisons) - it's just amazing with what you can get away with these days.

Other than that I didn't do anything particularly clever or exciting to further minify the game.

The full source for Underrun is on github: github.com/phoboslab/underrun

Be sure to have a look at some of the other js13Games entries this year. Some of my favorites are The Chroma Incident, 1024 Moves, ISP, Wander and Envisonator.

Impact Is Now Free & Open Source

My HTML5 Game Engine Impact launched almost 8 years ago. The last update was published in 2014. While Impact still works nicely in modern browsers, it lacks support for better graphic and sound APIs that are now available. I felt increasingly bad for selling a product that is hardly maintained or improved.

So as of today Impact will be available completely for free, published under the permissive MIT License.

Impact's source is available on github.com/phoboslab/impact

Thanks anyone who bought a license in the last few years!

Decode It Like It's 1999

A few years ago I started to work on an MPEG1 Video decoder, completely written in JavaScript. Now, I finally found the time to clean up the library, improve its performance, make it more error resilient and modular and add an MP2 Audio decoder and MPEG-TS demuxer. This makes this library not just an MPEG decoder, but a full video player.

In this blog post I want to talk a bit about the challenges and various interesting bits I discovered during the development of this library. You'll find a demo, the source and documentation and reasons why to use JSMpeg over on the official website:

jsmpeg.com - Decode it like it's 1999

Refactoring

Recently, I needed to implement audio streaming into JSMpeg for a client and only then realized in what a pitty state the library is. It has grown quite a bit since its first release. A WebGL renderer, WebSocket client, progressive loading, benchmarking facilities and much more have been tacked on in the last few years. All kept in a single, monolithic class with conditionals bursting at the seams.

I decided to clean up this mess first by separating its logical components. I also sketched out what would be needed for the sound implementation: a Demuxer, MP2 decoder and Audio Output:

Plus some auxiliary classes:

Each of the components (apart from the Sources) has a .write(buffer) method to feed it with data. These components can then "connect" to a destination that receives the processed result. The complete flow through the library looks like this:

                 / -> MPEG1 Video Decoder -> Renderer
Source -> Demuxer  
                 \ -> MP2 Audio Decoder -> Audio Output

JSMpeg currently has 3 different implementations for the Source (AJAX, AJAX progressive and WebSocket) and there's 2 different Renderers (Canvas2D and WebGL). The rest of the library is agnostic to these – i.e. the Video Decoder doesn't care about the Renderers internals. With this approach it's easy to add new components: further Sources, Demuxers, Decoders or Outputs.

I'm not completely happy with how these connections work in the library. Each component can only have one destination (apart from the Demuxer, that has one destination per stream). It's a tradeoff. In the end, I felt that anything else would be over engineering and complicating the library for no good reason.

WebGL Rendering

One of the most computationally intensive tasks for an MPEG1 decoder is the color conversion from MPEG's internal YUV format (Y'Cr'Cb to be precise) into RGBA so that the browser can display it. Somewhat simplified, the conversion looks like this:

for (var i = 0; i < pixels.length; i+=4 ) {
    var y, cb, cr = /* fetch this from the YUV buffers */;

    pixels[i + 0 /* R */] = y + (cb + ((cb * 103) >> 8)) - 179;
    pixels[i + 1 /* G */] = y - ((cr * 88) >> 8) - 44 + ((cb * 183) >> 8) - 91;
    pixels[i + 2 /* B */] = y + (cr + ((cr * 198) >> 8)) - 227;
    pixels[i + 4 /* A */] = 255;
}

For a single 1280x720 video frame that loop has to run 921600 times to convert all pixels from YUV to RGBA. Each pixel needs 3 writes to the destination RGB array (we can pre-populate the alpha component since it's always 255). That's 2.7 million writes per frame, each needing 5-8 adds, subtracts, multiplies and bit shifts. For a 60fps video, we end up with more than 1 billion operations per second. Plus the overhead for JavaScript. The fact that JavaScript can do this, that a computer can do this, still boggles my mind.

With WebGL, this color conversion (and subsequent displaying on the screen) can be sped up tremendously. A few operations for each pixel is the bread and butter of GPUs. GPUs can process many pixels in parallel, because they're independent of any other pixel. The WebGL shader that's run on the GPU doesn't even need these pesky bit shifts – GPUs likes floating point numbers:

void main() {
    float y = texture2D(textureY, texCoord).r;
    float cb = texture2D(textureCb, texCoord).r - 0.5;
    float cr = texture2D(textureCr, texCoord).r - 0.5;

    gl_FragColor = vec4(
        y + 1.4 * cb,
        y + -0.343 * cr - 0.711 * cb,
        y + 1.765 * cr,
        1.0
    );
}

With WebGL, the time needed for the color conversion dropped from 50% of the total JS time to just about 1% for the YUV texture upload.

There was one minor issue I stumbled over with the WebGL renderer. JSMpeg's video decoder does not produce three Uint8Arrays for each color plane, but Uint8ClampedArrays. It's doing this, because the MPEG1 standard mandates that decoded color values must be clamped, not wrap around. Letting the browser do the clamping through the ClampedArray works out faster than doing it in JavaScript.

A bug that still stands in some Browsers (Chrome and Safari) prevents WebGL from using the Uint8ClampedArray directly. Instead, for these browsers we have to create a Uint8Array view for each array for each frame. This operation is pretty fast since nothing needs to be copied, but I'd still like to do without it.

JSMpeg detects this bug and only uses the workaround if needed. We simply try to upload a clamped array and catch the error. This detection sadly triggers an un-silencable warning in the console, but it's better than nothing.

WebGLRenderer.prototype.allowsClampedTextureData = function() {
    var gl = this.gl;
    var texture = gl.createTexture();

    gl.bindTexture(gl.TEXTURE_2D, texture);
    gl.texImage2D(
        gl.TEXTURE_2D, 0, gl.LUMINANCE, 1, 1, 0,
        gl.LUMINANCE, gl.UNSIGNED_BYTE, new Uint8ClampedArray([0])
    );
    return (gl.getError() === 0);
};

WebAudio for Live Streaming

For the longest time I assumed that in order to feed WebAudio with raw PCM sample data without much latency or pops and cracks, you'd have to use a ScriptProcessorNode. You'd copy your decoded sample data just in time whenever you get the callback from the script processor. It works. I tried it. It needs quite a bit of code to function properly and of course it's computationally intensive and inelegant.

Luckily, my initial assumption was wrong.

The WebAudio Context maintains its own timer that's separate from JavaScript's Date.now() or performance.now(). Further, you can instruct your WebAudio sources to start() at a precise time in the future based on the context's time. With this, you can string very short PCM buffers together without any artefacts.

You only have to calculate the start time for the next buffer by continuously adding the duration of all previous ones. It's important to always use the WebAudio Context's own time for this.

var currentStartTime = 0;

function playBuffer(buffer) {
    var source = context.createBufferSource();
    /* load buffer, set destination etc. */

    var now = context.currentTime;
    if (currentStartTime < now) {
        currentStartTime = now;
    }

    source.start(currentStartTime);
    currentStartTime += buffer.duration;
}

There's a caveat though: I needed to get the precise remaining duration of the enqueued audio. I implemented it simply as the difference between the current time and the next start time:

// Don't do that!
var enqueuedTime = (currentStartTime - context.currentTime);

It took me a while to figure it out, but this doesn't work. You see, the context's currentTime is only updated every so often. It's not a precise real time value.

var t1 = context.currentTime;
doSomethingForAWhile();
var t2 = context.currentTime;

t1 === t2; // true

So, if you need the precise audio play position (or anything based on it), you have to revert to JavaScript's performance.now().

Audio Unlocking on iOS

You gotta love the shit that Apple throws into Web devs faces from time to time. One of those things is the need to unlock audio on a page before you can play anything. Basically, audio playback can only be started as a response to a user action. You click on a button, audio plays.

This makes sense. I won't argue against it. You don't want to have audio blaring at you unannounced when you visit a page.

What makes it shitty, is that Apple neither provided a way to cleanly unlock Audio nor a way to ask the WebAudio Context if it's unlocked already. What you do instead, is to play an Audio source and continually check if it's progressing. You can't chek immediately after playing, though. No, no. You have to wait a bit!

WebAudioOut.prototype.unlock = function(callback) {
    // This needs to be called in an onclick or ontouchstart handler!
    this.unlockCallback = callback;

    // Create empty buffer and play it
    var buffer = this.context.createBuffer(1, 1, 22050);
    var source = this.context.createBufferSource();
    source.buffer = buffer;
    source.connect(this.destination);
    source.start(0);

    setTimeout(this.checkIfUnlocked.bind(this, source, 0), 0);
};

WebAudioOut.prototype.checkIfUnlocked = function(source, attempt) {
    if (
        source.playbackState === source.PLAYING_STATE || 
        source.playbackState === source.FINISHED_STATE
    ) {
        this.unlocked = true;
        this.unlockCallback();
    }
    else if (attempt < 10) {
        // Jeez, what a shit show. Thanks iOS!
        setTimeout(this.checkIfUnlocked.bind(this, source, attempt+1), 100);
    }
};

Progressive Loading via AJAX

Say you have a 50mb video file that you load via AJAX. The video starts loading no problem. You can even check the current process (downloaded vs. total bytes) and display a nice loading animation. What you can not do, is to access the already downloaded data while the rest of the file is still loading.

There have been some proposals for adding chunked ArrayBuffers into XMLHttpRequest, but nothing has been implemented across browsers. The newer fetch API (that I still don't understand the purpose of) proposed some similar features, but again: no cross browser support. However, we can still do the chunked downloading in JavaScript using Range-Requests.

The HTTP standard implements a Range header that allows you to only grab part of a resource. If you just need the first 1024 bytes of a big file, you set the header Range: bytes=0-1024 in your request. Before we can start though, we have to figure out how large the file. We can do this with a HEAD request, instead of a GET. This returns only the HTTP headers for the resource, but none of the body bytes. Range-Requests are supported by almost all HTTP servers. The one exception I know of, is PHP's built-in development server.

JSMpeg's default chunk size for downloading via AJAX is 1mb. JSMpeg also appends a custom GET parameter to the URL (e.g. video.ts?0-1024) for each request, so that each chunk essentially gets its own URL and plays nice with bad caching proxies.

With this in place, you can start playing the file as soon as the first chunk has arrived. Also, further chunks will only be downloaded when they're needed. If someone only watches the first few seconds of a video, only those first few seconds will get downloaded. JSMpeg does this by measuring the time it took to load a chunk, adding a lot of safety margin and comparing this to the remaining duration of the already loaded chunks.

In JSMpeg, the Demuxer splits streams as fast as it can. It also decodes the presentation time stamp (PTS) for each packet. The video and audio decoders however only advance their play position in real-time increments. The difference between the last demuxed PTS and the decoder's current PTS is the remaining play time for the downloaded chunks. The Player periodically call's the Source's resume() method with this headroom time:

// It's a silly estimate, but it works
var worstCaseLoadingTime = lastChunkLoadingTime * 8 + 2;
if (worstCaseLoadingTime > secondsHeadroom) {
    loadNextChunk();
}

Audio & Video Sync

JSMpeg tries to play audio as smoothly as possible. It doesn't introduce any gaps or compressions when queuing up samples. Video playback orients itself on the audio playback position. It's done this way, because even the tiniest gaps or discontinuities are far more perceptible in audio than in video. It's far less jarring if a video frame is a few milliseconds late or dropped.

For the most part, JSMpeg relies on the presentation time stamp (PTS) of the MPEG-TS container for playback, instead of calculating the playback time itself. This means, the PTS in the MPEG-TS file have to be consistent and accurate. From what I gathered from the internet, this is not always the case. But modern encoders seemed to have figured this out.

One complication was that the PTS doesn't always start at 0. For instance, if you have a WebCam connected and running for a while, the PTS may be the start time when the WebCam was turned on, not when recording started. Therefore, JSMPeg searches for the first PTS it can find and uses that as the global start time for all streams.

The MPEG1 and MP2 decoders also keep track of all PTS they received alongside with the buffer position of each PTS. With this, we can seek through the audio and video streams to a specific time.

Currently, JSMpeg will happily seek to an inter-frame and decode it on top of the previously decoded frame. The correct way to handle this, would be to rewind to the last intra-frame before the one we seek to and decode all frames in between. This is something I still need to fix.

Build Tools & The JavaScript Ecosystem

I avoid build tools wherever I can. Chances are, your shiny toolset that automates everything for you, will stop working in a year or two. Setting up your build environment is never as easy as "just call webpack", or use grunt or whatever task runner is the hot shit today. It always ends up like

(...) Where do I get webpack from? Oh, I need npm.
Where do I get npm from? Oh, I need nodejs.
Where do I get nodejs from? Oh, I need, homebrew.
What's that? gyp build error? Oh, sure, I need to install XCode.
Oh, webpack needs the babel plugin?
What? The left-pad dependency could not be resolved?
...

And suddenly you spent two hours of your life and downloaded several GB of tools. All to build a 20kb library, for a language that doesn't even need compiling. How do I build this library 2 years from now? 5 years?

I had a thorough look at webpack and hated it. It's way too complex for my taste. I like to understand what's going on. That's part of the reason I wrote this library instead of diving into WebRTC.

So, the build step for JSMpeg is a shell script with a single call to uglifyjs that can be altered to use cat (or copy on Windows) in 2 seconds. Or you simply load the source files separately in your HTML while you're working on it. Done.

Quality, Bitrates And The Future

The quality of MPEG1 at reasonable bitrates is, much to my surprise, not bad at all. Have a look at the demo video on jsmpeg.com - granted, it's a favorable case for compression. Slow movement and not too many cuts. Still, this video weighs in at 50mb for it's 4 minutes, and provides a quality comparable to most Youtube videos that are "only" 30% smaller.

In my tests, I could always get video that I'd consider "high quality" at max 2Mbit/s. Depending on your use-case (want a coffee cam?), you can go to 100Kbit/s or even lower. There's no bottom limit for the bitrate/framerate.

You could get a cheap cell phone contract with a 1GB/month data limit, put a 3G dongle and a webcam on a Raspberry Pi, attach it to a 12 V automotive battery, throw it on your crops field and get a live weather cam that doesn't need any infrastructure or maintenance for a few years and is viewable in your smartphone's browser without installing anything.

The simplicity of MPEG1, compared to modern codecs, makes it very attractive in my opinion. It's well understood and there's a ton of tools that can work with it. All patents relating to MPEG1/MP2 have expired now. It's a free format.

Do you remember the GIF revival after its patents expired?

Quake for Oculus Rift

My Oculus Rift CV1 finally arrived last week, so of course I had to update my fork of the excellent Quakespasm engine for the new Oculus 1.4 API.

Quake VR

While I was at it, I also fixed a number of bugs, included support for the Rift's headphones and the XBox One controller – I'd still recommend you play with mouse & keyboard, though. Controls are set up with reasonable defaults, but you can of course change everything in the options menu. There's also a new VR/HMD options menu where you can configure your preferred aim mode (e.g. decouple it from the view) and crosshair style.

The game starts directly in VR Mode, disables view bobbing and sets the texture mode to NEAREST for the proper, pixelated oldschool vibe.

Once you're in the game, press END or the start button on your gamepad to recenter the view.

Short playthrough of the first few levels of Quake in VR

Download and Source

The Absolute Worst Way To Read Typed Array Data with JavaScriptCore

Edit: as of September 13, 2016 Apple introduced a native API to manipulate Typed Arrays as part of iOS 10. This makes this whole article obsolete, if still technically interesting.

Oh man, where do I even begin.

Back when the HTML5 Canvas element was introduced by Apple to power some Dashboard widgets, the 2D Drawing Context supported a method to read the raw pixel data. This method was called getImageData() and it returned an ImageData object with a data array. This array however was not a plain JavaScript Array, but a new type called CanvasPixelArray.

This CanvasPixelArray behaved like a normal array for the most part, with some important exceptions: each element was clamped to values between 0-255 and the array was not resizable. This was done for performance reasons. With this restriction, the array could be allocated with a single memory region in the JavaScript engine.

I.e. the JavaScript engine's C++ source for getImageData() looked somewhat like this:

JSObjectRef GetImageData(int width, int height) {
    // Allocate memory for the ImageData, 4 bytes for each pixel (RGBA)
    uint8_t *backingStore = malloc(width * height * 4);

    // Read the pixel data from the canvas and store in the backingStore
    // ...
}

In JavaScript you could then read, manipulate and write image data like this:

// Read a portion of the canvas
var imageData = ctx.getImageData(0, 0, 320, 240);
var pixels = imageData.data;

// Set the Red channel for each pixel to a random value
for (var i = 0; i < pixels.length; i += 4) {
    pixels[i] = Math.random() * 255;
}

// Write it back to the canvas
ctx.putImageData(imageData, 0, 0);

Manipulating pixels on the CPU is generally a bad idea, but using a plain old JavaScript Array would have made it prohibitively slow and memory inefficient. With the CanvasPixelArray it was at least feasible – this new array type deeply made sense.

Later, with the arrival of WebGL, the W3C realized that we need better facilities to deal with low-level binary data. CanvasPixelArray was a good start at making raw byte buffers faster, but other data types would be needed. So the WebGL draft proposed some new types in the same vain. Collectively, these were called Typed Arrays, because, other than plain JavaScript arrays, these had a fixed value type.

Modern JavaScript engines now support Typed Arrays in signed and unsigned flavors, with 8, 16 or 32 bit Integer values as well as 32 bit float and 64 bit float versions. E.g. Int32Array, Uint16Array and Float32Array.

At the same time, the good old CanvasPixelArray was renamed to Uint8ClampedArray. It now supports all the same methods as all the other Typed Arrays.

A very important addition introduced with these Typed Arrays is the fact that the actual data is not stored in the array but in a separate ArrayBuffer. The Typed Array itself is then only a View of this buffer. Furthermore, this buffer can be shared between different Views.

// Create a Uint32Array with 64 values. This allocates an ArrayBuffer with
// 256 bytes (64 values × 4 bytes) behind the scenes.
var u32 = new Uint32Array(64);
u32.length; // 64
u32.byteLength; // 256

// Create a Uint8Array, sharing the same buffer
var u8 = new Uint8Array(u32.buffer);
u8.length; // 256
u8.byteLength; // 256

// Modifying data on either View will modify it in all other views sharing
// this buffer
u8[1] = 1;
u32[0]; // 256

Every browser supporting WebGL or the Canvas2D context needs to be able to read and write the data of these Typed Arrays in native (C++) code as fast as possible. Say for instance you create a Float32Array with the vertex data of a 3D model in JavaScript. The WebGL implementation of your Browser needs to hand this data over to the GPU (via OpenGL or Direct3D) and - if possible - do it without copying.

Fortunately, as we established earlier, JavaScript engines only need to allocate a single lump of data for each array. A WebGL implementation can simply hand over the address of this data to the GPU and be done with it. E.g. an implementation of WebGL's gl.bufferData() might look like this:

void WebGLBufferData(JSObjecRef array) {
    void *ptr= JSTypedArrayGetDataPtr(array);
    size_t length = JSTypedArrayGetLength(array);

    // Call the OpenGL API with the data pointer we got
    glBufferData(GL_ARRAY_BUFFER, length, ptr, GL_STATIC_DRAW); 
}

There. Problem solved. Fast access to binary data without copying or conversion of some sort.

So, if all is good and well, what's with the title of this post then? Well, this is where the fun part starts. Buckle up.

Back in 2010 I started an Open Source project for iOS called Ejecta. It implements the Canvas2D and WebGL APIs in fast, native code and exposes it to a JavaScript runtime. Ejecta then lets you execute JavaScript code to draw to the screen using Canvas2D or WebGL. This is way faster than a browser, because it doesn't carry around all the bloat of a browser. It's a Canvas implementation without the HTML cruft around it.

As with all the Canvas implementations of modern browsers, Ejecta needs to read and write Typed Arrays as fast as possible. Ejecta uses JavaScriptCore (JSC), Apple's JavaScript engine to execute JavaScript code. I didn't specifically choose JSC. Rather, JSC is the only modern JavaScript engine that Apple allows you to run on iOS.

On modern iOS versions, JSC is available as a public API, meaning anyone who writes an iOS App in native code can use the API of JSC to execute JavaScript in their App. The JSC API is a pretty minimal layer that sits on top of JSC. It used to expose everything you'd want to do with JSC for your native code, but has hopelessly fallen behind in the last few years.

You probably know already where this is going: while you can create JavaScript objects, expose native C functions to JavaScript and do all kinds of object manipulation, you cannot access Typed Array data through JSC's public API.

This is an absolute show stopper for a WebGL implementation that attempts to use the JSC API. Note that Apple doesn't have this problem with their WebGL implementation in Safari/Webkit, because they are not limited to this public JSC API. They can dive deep into the JSC internals and can get that data pointer directly.

We can not.

So I wrote my own API for JSC to deal with Typed Arrays. You don't need much, actually: a function to get the data pointer and the length, a function to construct a Typed Array with a given type and a function to get the type of an existing Typed Array.

You can see my Typed Array API in the JSTypedArray.h on Github. It's pretty straight forward. Likewise the implementation for this (JSTypedArray.cpp) is only about 130 lines of code.

Now, adding your own API to a library provided by iOS is a big no no. Remember, these functions need to access the internals of JSC and Apple hates that. So I forked JavaScriptCore and provide my own JSC lib as part of Ejecta. The compiled version is about 7mb in size, which isn't all too bad. Still, it's a shame that we have provide our own library just for these 130 lines of code.

Of course I tried to get my API into the official JSC library. I opened a bug report on Webkit.org in 2012 and proposed my Typed Array API design. In the following 3 years not much has happened, other than some bike shedding and lots of silence, despite my best efforts to improve the proposal. Sadly, it seems I'm pretty alone with this problem.

Let me use this opportunity to rant about the Open Source-ness of JSC for a bit. You see, JSC is part of the Webkit project, which touts to be Open Source. And it's true, you can access most of the source code; it's licensed under the LGPL License. This license states that if you want to modify JSC, such as Apple does, you have to provide:

(...) all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the library.

(emphasize mine)

Yet, the project files or scripts to compile JSC for iOS are curiously missing from the source code.

This wouldn't be a big deal if compiling JSC was easy, but it's absolutely not. I and many others have spent countless days building XCode project files for iOS. JSC is huge project, comprised of many different parts.

Just for kicks, there's one step that executes an Assembler (written in Ruby, no less), that compiles a custom Assembly language at compile time. The result of which is compiled into the final lib and used internally by JSC.

Another step compiles an executable with an empty main() function. This executable links in some other functions. A script then runs over this executable file an gathers the addresses of these functions. Again, the result of this is compiled into the final lib.

You can see how this may get tricky to set up for a new platform.

Every time we want to update our JSC library, we have to set up this build process anew. We also need to conform to Apple's platform policies with each new iOS version. Namely 64bit support and Bitcode compilation.

The last one is where it gets funky. To use Ejecta on Apple's new tvOS platform, you need to compile your App with Bitcode enabled. This also means that every library you use needs to be compiled with Bitcode enabled.

Some brave people actually figured out how to compile JSC with Bitcode enabled, but we have no way to verify its correctness. With all the funky custom Assembly language and function pointer addresses used in JSC, I wouldn't be surprised if the compiled Bitcode is indeed broken.

And lo and behold, it probably is broken. When I tried to upload my tvOS App to Apple's AppStore, the "Processing" step failed. Apple's support doesn't know why.

AppStore Screenshot

So, with no hope of seeing a Typed Array API in the official JSC anytime soon and my custom JSC lib "rejected" by Apple, there's only one option left and it's an ugly one.

Ejecta has to use the JSC lib provided by the OS and work around the missing Typed Array API somehow.

Ok, so we can't access a data pointer directly. Iterating over all values of the array separately in native code and storing them in a buffer should work too.

In addition to a GetProperty() function that reads a property with a given name of a JavaScript object, the JSC API actually provides a faster GetPropertyAtIndex() function. This function takes an integer and returns a JSValue.

Lets copy all the values from a Uint8Array.

uint8_t *buffer = malloc(length);
JSValueRef value;
for (int i = 0; i < length; i++ ) {
    value = JSObjectGetPropertyAtIndex(ctx, jsArray, i, NULL);
    buffer[i] = JSValueToNumber(ctx, value, NULL);
}

Simple. And painfully slow. This takes about 1200ms per Megabyte on an iPhone5S.

Consider what the JSC API has to do behind the scenes for this GetPropertyAtIndex function: for each call, JSC needs to get the uint8 from the buffer and construct a new JSValue using that uint8.

Much worse though, the JSC API is designed to provide robust access from multiple threads. In order to make this work, each API call acquires an exclusive lock on the JavaScript Context. Imagine doing this in a loop, 2 million times per Megabyte and you'll see why this approach is a bad idea.

If we want to have some realtime animations, we only have 16ms per frame to do everything. And extracting data byte by byte from a Typed Array where I could previously simply get a pointer in constant time, is not something I planned for.

So we have to get the data out with fewer API calls. In the past though, before the arrival of Typed Arrays, JavaScript has been notoriously bad at dealing with big lumps of binary data. And the JSC API is stuck with the JavaScript of the past. The only other way to represent continuous, raw bytes is a String.

So I wrote a function to convert a Typed Array into a String.

Strings in JavaScript are stored somewhat like UTF16. It's not exactly UTF16, but the important thing is that each character of the string is internally stored in a 16 bit value. So we can store 16 bits or 2 bytes of our array in the String at a time.

The main trick here is to not iterate over the Typed Array in JavaScript, but instead call apply() on the String.fromCharCode() function. JavaScript's apply() calls a function in such a way that the 2nd argument to the function needs to be an Array or array like. Each array element is then treated as a separate argument to the function being invoked.

var func = function(a, b, c) { 
    console.log(a, b, c);
};
var array = new Uint8Array([1, 2, 3, 4]);
func.apply(null, array); // 1 2 3

There's a big caveat with this approach, though. In JSC function arguments are allocated on the stack and the number of arguments a function can take is hard limited to 64 thousand (0xffff to be precise).

So for big Typed Arrays we can't do the String conversion in one step, but rather have to subdivide the Array into smaller chunks. This however isn't much of a problem. If you remember from the beginning of the article, Typed Arrays are just a View on an underlying ArrayBuffer. Creating Sub-Arrays is pretty fast and will not copy the underlying data.

var TypedArrayToString = function(array) {
    "use strict";

    var chunkSize = 0x4000;
    var u16Count = this.byteLength >> 1;

    // Create a Uint16 View from the TypedArray or ArrayBuffer
    var u16 = array instanceof ArrayBuffer
        ? new Uint16Array(array, 0, u16Count)
        : new Uint16Array(array.buffer, array.byteOffset, u16Count);

    // If this array has an odd byte length, we have to append the last
    // byte separately, instead of reading it from the Uint16Array.
    var lastByte = "";
    if (array.byteLength % 2 !== 0) {
        var u8 = new Uint8Array(u16.buffer, u16.byteOffset);
        lastByte = String.fromCharCode(u8[u16Count * 2]);
    }

    if (u16Count < chunkSize) {
        // Fast case - data is smaller than chunk size and the conversion
        // can be done in one step.
        return String.fromCharCode.apply(null, array) + lastByte;
    }
    else {
        // Slow case - we need to split the data into smaller chunks,
        // collect them in an array and finally join them together.
        var chunks = [];
        for (var i = 0; i < u16Count; i += chunkSize) {
            var u16Sub = u16.subarray(i, i + chunkSize);
            chunks.push(String.fromCharCode.apply(null, u16Sub));
        }
        chunks.push(lastByte);
        return chunks.join("");
    }
};

Along with the native C function that invokes this function, takes the returned string and copies it into our own, native buffer, this takes about 60ms per Megabyte. About 55ms of which are spent in the JavaScript function. Much better than 1200ms, but still extremely slow.

After a while of pondering, it dawned on me. If String.fromCharCode can take the whole Array as function arguments - and is pretty fast at doing so - so could I with a custom function. Better yet, instead of just encoding 16 bits at a time in a character, I could use full 32 bit integers.

int32_t *CurrentDataPointer;

JSValueRef AppendDataCallback(
    JSContextRef ctx, JSObjectRef function, JSObjectRef thisObject,
    size_t argumentCount, const JSValueRef argumentValues[],
    JSValueRef* exception
) {     
    for (int i = 0; i < argumentCount; i++) {
        CurrentDataPointer[i] = JSValueToNumber(ctx, argumentValues[i]);
    }

    CurrentDataPointer += argc;
    return NULL;
}

This native function uses a global CurrentDataPointer (for the sake of this example) to copy the Array data into. It then simply iterates over the argumentValues, converts them to integers and stores them. The CurrentDataPointer is then incremented by the argument count - the amount of data already written.

This function is called (via apply()) from JavaScript, with the Typed Array spread out as separate argumentValues. For big Arrays this function needs to be invoked a number of times, because of the stack size limit. So it's important to keep track of the current write position.

What have we gained? Not much. Actually this performs worse than String.fromCharCode.

You see, String.fromCharCode doesn't need to acquire exclusive locks on the JavaScript Context when converting JSValues to native integers. It cheats by not using the JSC API's JSValueToNumber() function, but rather just casts the JSValue to an integer.

Excursion: How JavaScript Values are stored in JSC

On 64 bit systems, all JSValues are "pointers" with a 64 bit width. These pointers point to the actual JavaScript Object (a C++ class) in memory. However, for Numbers, true, false, null and undefined the pointer is not a pointer at all. Instead, the value is directly encoded in this 64 bit pointer.

How do we know the pointer is not a pointer but an immediate value? We use a few bits of this 64 bit value as tags. Current CPUs actually only use 48 bits for addressing memory (that's still enough for 256 Petabyte!), so we can be sure the upper 16 bits of the value are not used by pointers.

We can use some of these upper 16 bits to tag the type of this pointer. If some of these bits are set, the pointer is actually an immediate value.

Now, Numbers in JavaScript are either stored as 32 bit integer, or 64 bit double precision float. We don't have a problem to fit the 32 bit integer in our 64 bit, but the double value uses the whole 64 bit space. Well not exactly. Some floating point numbers are actually "invalid", NaN or Not a Number, such as the result when you divide by zero.

JSC uses one of these NaN values, one that is not produced by any other operation, to indicate that the 64 bit value is indeed not a double, but an integer. This is called NaN-boxing and is used by many different implementations of JavaScript, Python and other languages.

0x00000001 1846bf80 – None of the upper 16 bits are set; this is an actual
                      pointer to an Object in memory. 

0x400a21ca c083126f – Some of the upper 16 bits are set and this is not a
                      NaN value, so it has to be a double. (3.1415)

0xffff0000 0000007b - All upper 16bit are set to indicate an int value.
                      Interpreted as a double, setting the upper 16bit would
                      always cause it to be NaN. The int is directly stored
                      in the lower 32 bit. (123)

So, if we only care about Numbers, which we do since Typed Arrays only store numbers, we can circumvent calling the JSC API and decode this JSValueRef pointer by ourself. There's a bit more going on with this NaN boxing then I lined out here, but that's the gist of it.

Fun fact: 64 bit doubles can represent any 32 bit integer with exact precision. So we're safe to return a double in any case.

double JSValueToNumberFast(JSValueRef v) {
    union {
        int64_t asInt64;
        double asDouble;
        struct { int32_t asInt; int32_t tag; } asBits;
    } taggedValue = { .asInt64 = (int64_t)v };

    #define DoubleEncodeOffset 0x1000000000000ll
    #define TagTypeNumber 0xffff0000
    #define ValueTrue 0x7

    if((taggedValue.asBits.tag & TagTypeNumber) == TagTypeNumber) {
        return taggedValue.asBits.asInt;
    }
    else if (taggedValue.asBits.tag & TagTypeNumber) {
        taggedValue.asInt64 -= DoubleEncodeOffset;
        return taggedValue.asDouble;
    }
    else if (taggedValue.asBits.asInt == ValueTrue) {
        return 1.0;
    }
    else {
        return 0; // false, undefined, null, object
    }
}

With our own JSValueToNumberFast function in the loop of the AppendDataCallback, the time to extract 1 Megabyte is about 15ms. Not too bad. But we can do better.

The AppendDataCallback is only ever called with JSValue arguments storing 32 bit integers. So we know each JSValueRef is an integer. We don't have to test for it and we don't need to care about bools or doubles either. We can simply extract the lower 32 bit and treat it as an it. So how about this?

inline int32_t JSValueToInt32Faster(JSValueRef v) {
    return (int32_t)v; // lower 32 bit of the 64 bit JSValueRef
}

Remember how the AppendDataCallback just gets a plain C Array of JSValueRefs? It's essentially a big buffer of 64 bit values.

Contents of JSValueRef argumentValues[]

0xffff0000 00000000 - integer 0
0xffff0000 00000001 - integer 1
0xffff0000 00000002 - integer 2
0xffff0000 00000003 - integer 3
...

Notice how we only care about the lower 32 bit of each JSValueRef. Essentially the lower 32 bit lane. There has to be something we can do to extract these integers faster instead of iterating one by one.

And for ARM CPUs there is. SIMD Intrinsics. Operations that are Single Instruction Multiple Data.

There are explicit SIMD operations that deal with this notion of lanes. Specifically we can use the following:

const int32x4x2_t lanes32 = vld2q_s32(argumentValues);

This loads 2 lanes of 4 × 32 bit values each from our argumentValues. We can then extract the lower lane with lanes32.val[0], storing 4 × 32 bit values at once.

If we iterate 4 values at a time, we still have to care about any leftovers. For instance with 11 arguments, we can use this ARM intrinsic 2 times and have to extract the remaining 3 integers separately.

static JSValueRef AppendDataCallback(
    JSContextRef ctx, JSObjectRef function, JSObjectRef thisObject,
    size_t argc, const JSValueRef argv[], JSValueRef* exception
) {
    AppendDataCallbackState *state=JSObjectGetPrivate(thisObject);
    int32_t *dst = state->currentDataPtr;
    int remainderStart = 0;

    // On 64bit systems where ARM_NEON instructions are available we can
    // use some SIMD intrinsics to extract the lower 32bit of each
    // JSValueRef. Hopefully the JSValueRef encodes an Int32 so the lower
    // 32bit corresponds to that Int32 exactly.

    #if __LP64__ && defined __ARM_NEON__
        // Iterate over the arguments in packs of 4. Load arguments as 
        // 2 lanes of 4 int32 values and store the first lane (the lower
        // 32bit) into the dst buffer.

        int argPacks4 = (int)argc/4;
        int32_t *src = (int32_t*)argv;

        for (int i = 0; i < argPacks4; i++) {
            const int32x4x2_t lanes32 = vld2q_s32(src);
            vst1q_s32(dst, lanes32.val[0]);
            src += 8;
            dst += 4;
        }
        remainderStart = argPacks4 * 4;
    #endif

    for (int i = remainderStart; i < argc; i++) {
        *(dst++) = GetInt32(ctx, argv[i]);
    }

    state->currentDataPtr += argc;
    return MakeInt32(ctx, (int)argc);
}

Down to 7ms per Megabyte.

This is as far as it goes. Most of the time spent to extract the data is now spent in JSC where I have no access to.

7ms/MB is still disappointingly slow when you want to do real time 3D stuff, with large, changing buffers. However, not having to maintain and compile my own fork of JavaScriptCore simplifies things quite a lot.

Performance is still good enough to bring Xibalba to the AppleTV in 60 FPS. It's still in Review for the AppleTV, but you can play it in your browser or download it from the AppStore on your iOS device.

Xibalba Xibalba – A WebGL First Person Shooter

I have published all my changes to Ejecta in the typed-array-clusterfuck Branch.

You can see the new API for Typed Arrays in the TypedArray.h (which mimics my original proposal quite closely). All the dirty implementation details and tricks are in the pedantically commented TypedArray.m.

As fascinating as this all was for me, keep in mind that the end result - 7ms/MB - is still 7ms slower than what we would have with a better API.

Please help me to get a real Typed Array API into JavaScriptCore: Comment on this Bug Report on Webkit.org.

Archive