PHOBOSLAB

Blog Home

Generic Syntax Highlighting with Regular Expressions

Ever tried to display syntax highlighted program code with PHP? There are some solutions, which are either totally overblown, produce horrible markup or need an external program.

I obviously didn’t like any of these, so I wrote my own syntax highlighting function for PHP. This function works great for a whole number of C-Style languages, but can also be used for SQL and many others. Read on for some examples and the highlighting code itself.

Examples

A short example with a MySQL query

SELECT
    postId, created, keyword, title, teaser
FROM pn_blog_posts
WHERE 
    keyword = 'hello-world' AND
    status >= 2
ORDER BY created DESC

And another one with a short C++ code snippet

Def * locate( string index="" ) {
    int start = 0, stop = 0;
    index = trim( index, "\t\n\r /" );
    if( index.empty() ) return this;

    // Descent into the tree
    Def * d = this;
    do {    
        stop = index.find_first_of( "/", start );
        string name = index.substr( start, stop - start);
        start = stop + 1;
        d = d->children[name];
    } while( stop != string::npos && d );

    return d;
}

Source

Here’s the complete PHP source of the syntax highlighter. Just use it like this:

echo SyntaxHighlight::process( $myCode );

The sole fact, that this function can highlight its own source (which makes extensive use of escaped characters, comments in strings etc.) without getting confused, should be demonstration enough of its robustness.

class SyntaxHighlight {
    public static function process( $s ) {
        $s = htmlspecialchars( $s );
        
        // Workaround for escaped backslashes
        $s = str_replace( '\\\\','\\\\<e>', $s ); 
        
        $regexp = array(
            // Comments/Strings
            '/(
                \/\*.*?\*\/|
                \/\/.*?\n|
                \#.*?\n|
                (?<!\\\)&quot;.*?(?<!\\\)&quot;|
                (?<!\\\)\'(.*?)(?<!\\\)\'
            )/isex' 
            => 'self::replaceId($tokens,\'$1\')',
            
            // Numbers (also look for Hex)
            '/(?<!\w)(
                0x[\da-f]+|
                \d+
            )(?!\w)/ix'
            => '<span class="N">$1</span>',
            
            // Make the bold assumption that an all uppercase word has a 
            // special meaning
            '/(?<!\w|>)(
                [A-Z_0-9]{2,}
            )(?!\w)/x'
            => '<span class="D">$1</span>', 
            
            // Keywords
            '/(?<!\w|\$|\%|\@|>)(
                and|or|xor|for|do|while|foreach|as|return|die|exit|if|then|else|
                elseif|new|delete|try|throw|catch|finally|class|function|string|
                array|object|resource|var|bool|boolean|int|integer|float|double|
                real|string|array|global|const|static|public|private|protected|
                published|extends|switch|true|false|null|void|this|self|struct|
                char|signed|unsigned|short|long
            )(?!\w|=")/ix'
            => '<span class="K">$1</span>', 
            
            // PHP/Perl-Style Vars: $var, %var, @var
            '/(?<!\w)(
                (\$|\%|\@)(\-&gt;|\w)+
            )(?!\w)/ix'
            => '<span class="V">$1</span>'
        );
        
        $tokens = array(); // This array will be filled from the regexp-callback
        $s = preg_replace( array_keys($regexp), array_values($regexp), $s );
        
        // Paste the comments and strings back in again
        $s = str_replace( array_keys($tokens), array_values($tokens), $s );
        
        // Delete the "Escaped Backslash Workaround Token" (TM) and replace 
        // tabs with four spaces.
        $s = str_replace( array( '<e>', "\t" ), array( '', '    ' ), $s );
        
        return '<pre>'.$s.'</pre>';
    }
    
    // Regexp-Callback to replace every comment or string with a uniqid and save 
    // the matched text in an array
    // This way, strings and comments will be stripped out and wont be processed 
    // by the other expressions searching for keywords etc.
    private static function replaceId( &$a, $match ) {
        $id = "##r".uniqid()."##";
        
        // String or Comment?
        if( $match{0} == '/' || $match{0} == '#' ) {
            $a[$id] = '<span class="C">'.$match.'</span>';
        } else {
            $a[$id] = '<span class="S">'.$match.'</span>';
        }
        return $id;
    }
}

You also need to define some CSS classes for each type of highlighted text. Here are the colors I used on this page:

pre { 
	font-family: Courier New, Bitstream Vera Sans Mono, monospace; 
	font-size: 9pt;
	border-top: 1px solid #333;
	border-bottom: 1px solid #333;
	padding: 0.4em;
	color: #fff;
}
pre span.N{ color:#f2c47f; } /* Numbers */
pre span.S{ color:#42ff00; } /* Strings */
pre span.C{ color:#838383; } /* Comments */
pre span.K{ color:#ff0078; } /* Keywords */
pre span.V{ color:#70d6ff; } /* Vars */
pre span.D{ color:#ff9a5d; } /* Defines */
Thursday, August 2nd 2007

7 Comments:

#1nick – Thursday, March 6th 2008, 21:42

fantastic little bit of code. I completely agree with "totally overblown, produce horrible markup...". I was on the same path but got stuck with numbers showing up highlighted in comments and such. thanks.

#2annoymouse – Saturday, March 29th 2008, 03:00

I prefer softwaremaniacs.org/soft/highlight/

#3 – jb – Thursday, August 21st 2008, 21:42

most excellent! thanks for putting it up :)

#4MOin – Wednesday, September 24th 2008, 20:50

oh very nice, easier walk through i've ever seen for syntax highlighting. thanks a alot

#5David – Tuesday, April 5th 2011, 21:25

Very nice! That bit of comment-parsing regex is exactly what I was just working on. However, yours seems to do much better.

Thank you for sharing!

#6Jelle_S – Tuesday, August 28th 2012, 20:42

Hi,

I was wondering: Is this peace of code hosted somewhere? And what is the license for this code?

Sometimes, when I'm bored, I work on sideprojects. And since I work a lot with Drupal, I was thinking of maybe creating a module for this syntaxhighlighter. But if I want to do that, I need to know it's GPL-compatible :-)

Thanks in advance

#7Dominic – Friday, August 31st 2012, 02:22

Consider this code as under the MIT License.

A similar, JavaScript only version is on github: github.com/phoboslab/jQuery-JSH

Post a Comment:

Comment: (Required)

(use <code> tags for preformatted text; URLs are recognized automatically)

Name: (Required)

URL:

Please type phoboslab into the following input field or enable Javascript. This is an anti-spam measure. Sorry for the inconvenience.