Regular Expression to Get All Image URLs From a Page

Matt — Thu, 30 Jul 2009 11:59:22 +0000

Have you ever wanted to, while working on some sort of PHP project, get an array listing of all the images used in a chunk of HTML? I’ve been planning out a web app over the past couple months, which will be doing a bit of RSS parsing, and I thought it would be nice to do just that, when it came time to start coding. Suppose you were going to show a summary of an article from a feed, with a link to the original source. Wouldn’t it look better if you pulled an image from that article, scaled it down if necessary, and displayed it next to it? (Caching it, of course. Hotlinking == bad.)

I was reading an article from Cats Who Code and, lo and behold, there was a code snippet that did just that with a regular expression. (I decided to file it away to save time in the future.)

$images = array();
preg_match_all('/(img|src)=("|')[^"'>]+/i', $data, $media);
unset($data);
$data=preg_replace('/(img|src)("|'|="|=')(.*)/i',"$3",$media[0]);
foreach($data as $url)
{
$info = pathinfo($url);
if (isset($info['extension']))
{
if (($info['extension'] == 'jpg') ||
($info['extension'] == 'jpeg') ||
($info['extension'] == 'gif') ||
($info['extension'] == 'png'))
array_push($images, $url);
}
}

Source: 15 PHP regular expressions for web developers

Feed-parsing is an excellent use for this, as you have just the article, no layout-related imagery, like you would see if you were screen-scraping a web page to obtain the image URLs. Though I imagine Digg takes the latter route when they dig-up (Freudian pun unintended, honest) the thumbnails that go along with their news links.

Parse a WordPress Plugin’s README.txt With Regular Expressions

Matt — Wed, 08 Jul 2009 11:08:40 +0000

I’ve been working on a neat enhancement for my Tweetable WordPress plugin. Already I have a handy “Documentation” link on the plugin’s pages in the WordPress admin. When clicked, it opens a ThickBox dialog pointing to the README.txt file.

Not bad, but it had a few rough edges. Raw markdown doesn’t look look stellar, and then there was the problem with the horizontal scrollbars that would appear from loading a plain text file into the ThickBox. So I made a new script that would load-up the README.txt file and use Regular Expressions to parse some of the more basic markdown syntax into good old HTML.

As I write this, the changes haven’t been released to the public quite yet, as I have a few more things to finish up before putting out a new patch to the plugin, but they’re on their way.

How do you pull off something like this? It’s not too hard.

First, dump a basic HTML page wrapper into your new PHP file:





Documentation

Now, between the two body tags, we’ll put the beginnings of our script. We need to reference wp-load.php, so we can access a few WordPress-related functions later.


Now it’s time to load the README.txt file. Once we dump the contents into a variable, we run them through a series of functions. wp_specialchars() to escape PHP code and other unpleasant things, nl2br() to turn each newline character into a 
 tag (which makes the text nice and readable, instead of a jumbled mess), and finally make_clickable() to turn any URLs into clickable links.
$readme = file_get_contents('readme.txt');
$readme = make_clickable(nl2br(wp_specialchars($readme)));

With that out of the way, we move on to actually parsing some of the markdown formatting. Let’s start with turning backticks (`) into HTML  and  tags.
$readme = preg_replace('/`(.*?)`/', '\\1', $readme);

It may look a bit…strange, but that line does just as advertised. The / characters signify the start and end of a Regular Expression, and the middle part isn’t too hard to guess at. The backticks are the markdown formatting we see wrapping a section of code (e.g. `echo $this;`) The part between the backticks, enclosed by the parenthesis, means “one or more of any sort of letter, number, or character.” The second argument of preg_replace() is the part we’ll be replacing the matches with, code tags with the content inside the backticks (represented as \\1) inside them.
Now we do a similar thing for *italics* and **bold text**. It’s important to put the line for the boldface formatting before the one for the italics, otherwise you’ll have some Unexpected Results happening.
$readme = preg_replace('/[\040]\*\*(.*?)\*\*/', ' \\1', $readme);
$readme = preg_replace('/[\040]\*(.*?)\*/', ' \\1', $readme);

This one looks like more of a mess, doesn’t it? That’s because we have to escape the asterisks with backslashes (i.e. \*), as the asterisk has meaning in a regular expression otherwise. The [\040], which represents a space character, is added so the expression will only match instances where the first asterisk has a space in front of it. This is mainly a safety feature, so no code snippets break anything…
Next we handle headings, which are marked-up as one to three equality signs on either side of a line of text.
$readme = preg_replace('/=== (.*?) ===/', '\\1', $readme);
$readme = preg_replace('/== (.*?) ==/', '\\1', $readme);
$readme = preg_replace('/= (.*?) =/', '\\1', $readme);

Once again, the order of the lines matters.
Now all that needs to be done is to echo-out the text and close our PHP block:
echo $readme;
?>

That wasn’t too hard was it? It’s only the most basic markdown syntax that’s being parsed, but it’s lightyears better than plain text.

Webmaster-Source » Regular Expressions

Regular Expression to Get All Image URLs From a Page

Parse a WordPress Plugin’s README.txt With Regular Expressions

\\1

\\1

\\1