Generic Syntax Highlighting with Regular Expressions
Ever tried to display syntax highlighted program code with PHP? There are some solutions, which are either totally overblown, produce horrible markup or need an external program.
I obviously didn’t like any of these, so I wrote my own syntax highlighting function for PHP. This function works great for a whole number of C-Style languages, but can also be used for SQL and many others. Read on for some examples and the highlighting code itself.
Examples
A short example with a MySQL query
SELECT postId, created, keyword, title, teaser FROM pn_blog_posts WHERE keyword = 'hello-world' AND status >= 2 ORDER BY created DESC
And another one with a short C++ code snippet
Def * locate( string index="" ) { int start = 0, stop = 0; index = trim( index, "\t\n\r /" ); if( index.empty() ) return this; // Descent into the tree Def * d = this; do { stop = index.find_first_of( "/", start ); string name = index.substr( start, stop - start); start = stop + 1; d = d->children[name]; } while( stop != string::npos && d ); return d; }
Source
Here’s the complete PHP source of the syntax highlighter. Just use it like this:
echo SyntaxHighlight::process( $myCode );
The sole fact, that this function can highlight its own source (which makes extensive use of escaped characters, comments in strings etc.) without getting confused, should be demonstration enough of its robustness.
class SyntaxHighlight { public static function process( $s ) { $s = htmlspecialchars( $s ); // Workaround for escaped backslashes $s = str_replace( '\\\\','\\\\<e>', $s ); $regexp = array( // Comments/Strings '/( \/\*.*?\*\/| \/\/.*?\n| \#.*?\n| (?<!\\\)".*?(?<!\\\)"| (?<!\\\)\'(.*?)(?<!\\\)\' )/isex' => 'self::replaceId($tokens,\'$1\')', // Numbers (also look for Hex) '/(?<!\w)( 0x[\da-f]+| \d+ )(?!\w)/ix' => '<span class="N">$1</span>', // Make the bold assumption that an all uppercase word has a // special meaning '/(?<!\w|>)( [A-Z_0-9]{2,} )(?!\w)/x' => '<span class="D">$1</span>', // Keywords '/(?<!\w|\$|\%|\@|>)( and|or|xor|for|do|while|foreach|as|return|die|exit|if|then|else| elseif|new|delete|try|throw|catch|finally|class|function|string| array|object|resource|var|bool|boolean|int|integer|float|double| real|string|array|global|const|static|public|private|protected| published|extends|switch|true|false|null|void|this|self|struct| char|signed|unsigned|short|long )(?!\w|=")/ix' => '<span class="K">$1</span>', // PHP/Perl-Style Vars: $var, %var, @var '/(?<!\w)( (\$|\%|\@)(\->|\w)+ )(?!\w)/ix' => '<span class="V">$1</span>' ); $tokens = array(); // This array will be filled from the regexp-callback $s = preg_replace( array_keys($regexp), array_values($regexp), $s ); // Paste the comments and strings back in again $s = str_replace( array_keys($tokens), array_values($tokens), $s ); // Delete the "Escaped Backslash Workaround Token" (TM) and replace // tabs with four spaces. $s = str_replace( array( '<e>', "\t" ), array( '', ' ' ), $s ); return '<pre>'.$s.'</pre>'; } // Regexp-Callback to replace every comment or string with a uniqid and save // the matched text in an array // This way, strings and comments will be stripped out and wont be processed // by the other expressions searching for keywords etc. private static function replaceId( &$a, $match ) { $id = "##r".uniqid()."##"; // String or Comment? if( $match{0} == '/' || $match{0} == '#' ) { $a[$id] = '<span class="C">'.$match.'</span>'; } else { $a[$id] = '<span class="S">'.$match.'</span>'; } return $id; } }
You also need to define some CSS classes for each type of highlighted text. Here are the colors I used on this page:
pre {
font-family: Courier New, Bitstream Vera Sans Mono, monospace;
font-size: 9pt;
border-top: 1px solid #333;
border-bottom: 1px solid #333;
padding: 0.4em;
color: #fff;
}
pre span.N{ color:#f2c47f; } /* Numbers */
pre span.S{ color:#42ff00; } /* Strings */
pre span.C{ color:#838383; } /* Comments */
pre span.K{ color:#ff0078; } /* Keywords */
pre span.V{ color:#70d6ff; } /* Vars */
pre span.D{ color:#ff9a5d; } /* Defines */
7 Comments:
fantastic little bit of code. I completely agree with "totally overblown, produce horrible markup...". I was on the same path but got stuck with numbers showing up highlighted in comments and such. thanks.
I prefer softwaremaniacs.org/soft/highlight/
most excellent! thanks for putting it up :)
oh very nice, easier walk through i've ever seen for syntax highlighting. thanks a alot
Very nice! That bit of comment-parsing regex is exactly what I was just working on. However, yours seems to do much better.
Thank you for sharing!
Hi,
I was wondering: Is this peace of code hosted somewhere? And what is the license for this code?
Sometimes, when I'm bored, I work on sideprojects. And since I work a lot with Drupal, I was thinking of maybe creating a module for this syntaxhighlighter. But if I want to do that, I need to know it's GPL-compatible :-)
Thanks in advance
Consider this code as under the MIT License.
A similar, JavaScript only version is on github: github.com/phoboslab/jQuery-JSH