Tag Archives: Regular Expressions

Regular Expression to Get All Image URLs From a Page

Have you ever wanted to, while working on some sort of PHP project, get an array listing of all the images used in a chunk of HTML? I’ve been planning out a web app over the past couple months, which will be doing a bit of RSS parsing, and I thought it would be nice to do just that, when it came time to start coding. Suppose you were going to show a summary of an article from a feed, with a link to the original source. Wouldn’t it look better if you pulled an image from that article, scaled it down if necessary, and displayed it next to it? (Caching it, of course. Hotlinking == bad.)

I was reading an article from Cats Who Code and, lo and behold, there was a code snippet that did just that with a regular expression. (I decided to file it away to save time in the future.)

$images = array();
preg_match_all('/(img|src)=("|')[^"'>]+/i', $data, $media);
unset($data);
$data=preg_replace('/(img|src)("|'|="|=')(.*)/i',"$3",$media[0]);
foreach($data as $url)
{
$info = pathinfo($url);
if (isset($info['extension']))
{
if (($info['extension'] == 'jpg') ||
($info['extension'] == 'jpeg') ||
($info['extension'] == 'gif') ||
($info['extension'] == 'png'))
array_push($images, $url);
}
}

Source: 15 PHP regular expressions for web developers

Feed-parsing is an excellent use for this, as you have just the article, no layout-related imagery, like you would see if you were screen-scraping a web page to obtain the image URLs. Though I imagine Digg takes the latter route when they dig-up (Freudian pun unintended, honest) the thumbnails that go along with their news links.

Parse a WordPress Plugin’s README.txt With Regular Expressions

I’ve been working on a neat enhancement for my Tweetable WordPress plugin. Already I have a handy “Documentation” link on the plugin’s pages in the WordPress admin. When clicked, it opens a ThickBox dialog pointing to the README.txt file.

Not bad, but it had a few rough edges. Raw markdown doesn’t look look stellar, and then there was the problem with the horizontal scrollbars that would appear from loading a plain text file into the ThickBox. So I made a new script that would load-up the README.txt file and use Regular Expressions to parse some of the more basic markdown syntax into good old HTML.

README.txt, parsed into HTML

Continue reading →