Generic Syntax Highlighting with Regular Expressions
Ever tried to display syntax highlighted program code with PHP? There are some solutions, which are either totally overblown, produce horrible markup or need an external program.
I obviously didn’t like any of these, so I wrote my own syntax highlighting function for PHP. This function works great for a whole number of C-Style languages, but can also be used for SQL and many others. Read on for some examples and the highlighting code itself.
Examples
A short example with a MySQL query
SELECT
postId, created, keyword, title, teaser
FROM pn_blog_posts
WHERE
keyword = 'hello-world' AND
status >= 2
ORDER BY created DESC
And another one with a short C++ code snippet
Def * locate( string index="" ) {
int start = 0, stop = 0;
index = trim( index, "\t\n\r /" );
if( index.empty() ) return this;
// Descent into the tree
Def * d = this;
do {
stop = index.find_first_of( "/", start );
string name = index.substr( start, stop - start);
start = stop + 1;
d = d->children[name];
} while( stop != string::npos && d );
return d;
}
Source
Here’s the complete PHP source of the syntax highlighter. Just use it like this:
echo SyntaxHighlight::process( $myCode );
The sole fact, that this function can highlight its own source (which makes extensive use of escaped characters, comments in strings etc.) without getting confused, should be demonstration enough of its robustness.
class SyntaxHighlight {
public static function process( $s ) {
$s = htmlspecialchars( $s );
// Workaround for escaped backslashes
$s = str_replace( '\\\\','\\\\<e>', $s );
$regexp = array(
// Comments/Strings
'/(
\/\*.*?\*\/|
\/\/.*?\n|
\#.*?\n|
(?<!\\\)".*?(?<!\\\)"|
(?<!\\\)\'(.*?)(?<!\\\)\'
)/isex'
=> 'self::replaceId($tokens,\'$1\')',
// Numbers (also look for Hex)
'/(?<!\w)(
0x[\da-f]+|
\d+
)(?!\w)/ix'
=> '<span class="N">$1</span>',
// Make the bold assumption that an all uppercase word has a
// special meaning
'/(?<!\w|>)(
[A-Z_0-9]{2,}
)(?!\w)/x'
=> '<span class="D">$1</span>',
// Keywords
'/(?<!\w|\$|\%|\@|>)(
and|or|xor|for|do|while|foreach|as|return|die|exit|if|then|else|
elseif|new|delete|try|throw|catch|finally|class|function|string|
array|object|resource|var|bool|boolean|int|integer|float|double|
real|string|array|global|const|static|public|private|protected|
published|extends|switch|true|false|null|void|this|self|struct|
char|signed|unsigned|short|long
)(?!\w|=")/ix'
=> '<span class="K">$1</span>',
// PHP/Perl-Style Vars: $var, %var, @var
'/(?<!\w)(
(\$|\%|\@)(\->|\w)+
)(?!\w)/ix'
=> '<span class="V">$1</span>'
);
$tokens = array(); // This array will be filled from the regexp-callback
$s = preg_replace( array_keys($regexp), array_values($regexp), $s );
// Paste the comments and strings back in again
$s = str_replace( array_keys($tokens), array_values($tokens), $s );
// Delete the "Escaped Backslash Workaround Token" (TM) and replace
// tabs with four spaces.
$s = str_replace( array( '<e>', "\t" ), array( '', ' ' ), $s );
return '<pre>'.$s.'</pre>';
}
// Regexp-Callback to replace every comment or string with a uniqid and save
// the matched text in an array
// This way, strings and comments will be stripped out and wont be processed
// by the other expressions searching for keywords etc.
private static function replaceId( &$a, $match ) {
$id = "##r".uniqid()."##";
// String or Comment?
if( $match{0} == '/' || $match{0} == '#' ) {
$a[$id] = '<span class="C">'.$match.'</span>';
} else {
$a[$id] = '<span class="S">'.$match.'</span>';
}
return $id;
}
}
You also need to define some CSS classes for each type of highlighted text. Here are the colors I used on this page:
pre {
font-family: Courier New, Bitstream Vera Sans Mono, monospace;
font-size: 9pt;
border-top: 1px solid #333;
border-bottom: 1px solid #333;
padding: 0.4em;
color: #fff;
}
pre span.N{ color:#f2c47f; } /* Numbers */
pre span.S{ color:#42ff00; } /* Strings */
pre span.C{ color:#838383; } /* Comments */
pre span.K{ color:#ff0078; } /* Keywords */
pre span.V{ color:#70d6ff; } /* Vars */
pre span.D{ color:#ff9a5d; } /* Defines */