Generic Syntax Highlighting with Regular Expressions

Ever tried to display syntax highlighted program code with PHP? There are some solutions, which are either totally overblown, produce horrible markup or need an external program.

I obviously didn’t like any of these, so I wrote my own syntax highlighting function for PHP. This function works great for a whole number of C-Style languages, but can also be used for SQL and many others. Read on for some examples and the highlighting code itself.

Examples

A short example with a MySQL query

SELECT
    postId, created, keyword, title, teaser
FROM pn_blog_posts
WHERE 
    keyword = 'hello-world' AND
    status >= 2
ORDER BY created DESC

And another one with a short C++ code snippet

Def * locate( string index="" ) {
    int start = 0, stop = 0;
    index = trim( index, "\t\n\r /" );
    if( index.empty() ) return this;

    // Descent into the tree
    Def * d = this;
    do {    
        stop = index.find_first_of( "/", start );
        string name = index.substr( start, stop - start);
        start = stop + 1;
        d = d->children[name];
    } while( stop != string::npos && d );

    return d;
}

Source

Here’s the complete PHP source of the syntax highlighter. Just use it like this:

echo SyntaxHighlight::process( $myCode );

The sole fact, that this function can highlight its own source (which makes extensive use of escaped characters, comments in strings etc.) without getting confused, should be demonstration enough of its robustness.

class SyntaxHighlight {
    public static function process( $s ) {
        $s = htmlspecialchars( $s );

        // Workaround for escaped backslashes
        $s = str_replace( '\\\\','\\\\<e>', $s ); 

        $regexp = array(
            // Comments/Strings
            '/(
                \/\*.*?\*\/|
                \/\/.*?\n|
                \#.*?\n|
                (?<!\\\)&quot;.*?(?<!\\\)&quot;|
                (?<!\\\)\'(.*?)(?<!\\\)\'
            )/isex' 
            => 'self::replaceId($tokens,\'$1\')',

            // Numbers (also look for Hex)
            '/(?<!\w)(
                0x[\da-f]+|
                \d+
            )(?!\w)/ix'
            => '<span class="N">$1</span>',

            // Make the bold assumption that an all uppercase word has a 
            // special meaning
            '/(?<!\w|>)(
                [A-Z_0-9]{2,}
            )(?!\w)/x'
            => '<span class="D">$1</span>', 

            // Keywords
            '/(?<!\w|\$|\%|\@|>)(
                and|or|xor|for|do|while|foreach|as|return|die|exit|if|then|else|
                elseif|new|delete|try|throw|catch|finally|class|function|string|
                array|object|resource|var|bool|boolean|int|integer|float|double|
                real|string|array|global|const|static|public|private|protected|
                published|extends|switch|true|false|null|void|this|self|struct|
                char|signed|unsigned|short|long
            )(?!\w|=")/ix'
            => '<span class="K">$1</span>', 

            // PHP/Perl-Style Vars: $var, %var, @var
            '/(?<!\w)(
                (\$|\%|\@)(\-&gt;|\w)+
            )(?!\w)/ix'
            => '<span class="V">$1</span>'
        );

        $tokens = array(); // This array will be filled from the regexp-callback
        $s = preg_replace( array_keys($regexp), array_values($regexp), $s );

        // Paste the comments and strings back in again
        $s = str_replace( array_keys($tokens), array_values($tokens), $s );

        // Delete the "Escaped Backslash Workaround Token" (TM) and replace 
        // tabs with four spaces.
        $s = str_replace( array( '<e>', "\t" ), array( '', '    ' ), $s );

        return '<pre>'.$s.'</pre>';
    }

    // Regexp-Callback to replace every comment or string with a uniqid and save 
    // the matched text in an array
    // This way, strings and comments will be stripped out and wont be processed 
    // by the other expressions searching for keywords etc.
    private static function replaceId( &$a, $match ) {
        $id = "##r".uniqid()."##";

        // String or Comment?
        if( $match{0} == '/' || $match{0} == '#' ) {
            $a[$id] = '<span class="C">'.$match.'</span>';
        } else {
            $a[$id] = '<span class="S">'.$match.'</span>';
        }
        return $id;
    }
}

You also need to define some CSS classes for each type of highlighted text. Here are the colors I used on this page:

pre { 
    font-family: Courier New, Bitstream Vera Sans Mono, monospace; 
    font-size: 9pt;
    border-top: 1px solid #333;
    border-bottom: 1px solid #333;
    padding: 0.4em;
    color: #fff;
}
pre span.N{ color:#f2c47f; } /* Numbers */
pre span.S{ color:#42ff00; } /* Strings */
pre span.C{ color:#838383; } /* Comments */
pre span.K{ color:#ff0078; } /* Keywords */
pre span.V{ color:#70d6ff; } /* Vars */
pre span.D{ color:#ff9a5d; } /* Defines */