Parse a WordPress Plugin’s README.txt With Regular Expressions

I’ve been working on a neat enhancement for my Tweetable WordPress plugin. Already I have a handy “Documentation” link on the plugin’s pages in the WordPress admin. When clicked, it opens a ThickBox dialog pointing to the README.txt file.

Not bad, but it had a few rough edges. Raw markdown doesn’t look look stellar, and then there was the problem with the horizontal scrollbars that would appear from loading a plain text file into the ThickBox. So I made a new script that would load-up the README.txt file and use Regular Expressions to parse some of the more basic markdown syntax into good old HTML.

README.txt, parsed into HTML

As I write this, the changes haven’t been released to the public quite yet, as I have a few more things to finish up before putting out a new patch to the plugin, but they’re on their way.

How do you pull off something like this? It’s not too hard.

First, dump a basic HTML page wrapper into your new PHP file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Documentation</title>
</head>
<body>

</body>
</html>

Now, between the two body tags, we’ll put the beginnings of our script. We need to reference wp-load.php, so we can access a few WordPress-related functions later.

<?php
require_once('../../../wp-load.php');

Now it’s time to load the README.txt file. Once we dump the contents into a variable, we run them through a series of functions. wp_specialchars() to escape PHP code and other unpleasant things, nl2br() to turn each newline character into a <br /> tag (which makes the text nice and readable, instead of a jumbled mess), and finally make_clickable() to turn any URLs into clickable links.

$readme = file_get_contents('readme.txt');
$readme = make_clickable(nl2br(wp_specialchars($readme)));

With that out of the way, we move on to actually parsing some of the markdown formatting. Let’s start with turning backticks (`) into HTML <code> and </code> tags.

$readme = preg_replace('/`(.*?)`/', '<code>\\1</code>', $readme);

It may look a bit…strange, but that line does just as advertised. The / characters signify the start and end of a Regular Expression, and the middle part isn’t too hard to guess at. The backticks are the markdown formatting we see wrapping a section of code (e.g. `echo $this;`) The part between the backticks, enclosed by the parenthesis, means “one or more of any sort of letter, number, or character.” The second argument of preg_replace() is the part we’ll be replacing the matches with, code tags with the content inside the backticks (represented as \\1) inside them.

Now we do a similar thing for *italics* and **bold text**. It’s important to put the line for the boldface formatting before the one for the italics, otherwise you’ll have some Unexpected Results happening.

$readme = preg_replace('/[\040]\*\*(.*?)\*\*/', ' <strong>\\1</strong>', $readme);
$readme = preg_replace('/[\040]\*(.*?)\*/', ' <em>\\1</em>', $readme);

This one looks like more of a mess, doesn’t it? That’s because we have to escape the asterisks with backslashes (i.e. \*), as the asterisk has meaning in a regular expression otherwise. The [\040], which represents a space character, is added so the expression will only match instances where the first asterisk has a space in front of it. This is mainly a safety feature, so no code snippets break anything…

Next we handle headings, which are marked-up as one to three equality signs on either side of a line of text.

$readme = preg_replace('/=== (.*?) ===/', '<h2>\\1</h2>', $readme);
$readme = preg_replace('/== (.*?) ==/', '<h3>\\1</h3>', $readme);
$readme = preg_replace('/= (.*?) =/', '<h4>\\1</h4>', $readme);

Once again, the order of the lines matters.

Now all that needs to be done is to echo-out the text and close our PHP block:

echo $readme;
?>

That wasn’t too hard was it? It’s only the most basic markdown syntax that’s being parsed, but it’s lightyears better than plain text.

  • Pingback: designfloat.com

  • http://gwynethllewelyn.net/ Gwyneth Llewelyn

    That’s so useful that I wonder if you didn’t do a plugin out of it! :) Really! One thing that pissed me off all the time is to have to deal with the manual formatting on my website every time the readme.txt changes, to keep the WordPress developers happy…

    A shortcode to do everything automatically would just be awesome; I imagine it could even retrieve the latest readme.txt from trunk and deal with that automagically…

  • http://gwynethllewelyn.net/ Gwyneth Llewelyn

    There is even a plugin that does this automatically: http://wordpress.org/extend/pl.....me-parser/

    That comes in handy for me, so I don’t need to manually edit my own plugin pages any longer :)