PHP Simple HTML DOM Parser

It’s always fun to obtain data from REST APIs and parse the XML or JSON response. Twitter, for sure, wouldn’t be what it is today if not for the thriving community of developers building applications that tie-in with the API.

But what do you do when you need to obtain information from a site that doesn’t have an API, or at least an RSS feed that you could dump into SimpleXML. You scrape the page. There are numerous methods of doing that, such as using file_get_contents() and passing the resulting HTML to Tidy (to convert everything to strict XHTML) before invoking SimpleXML.

One of the simplest options is S.C. Chen’s PHP Simple HTML DOM Parser. Once you include the PHP library, you gain access to a set of functions that lets you read and modify HTML content with jQuery-like selectors.

Here is an example of scraping Slashdot headlines:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
 echo $article->find('div.title', 0)->plaintext;
}

As usual, with great power comes great responsibility. There are certain ethical guidelines to data scraping. Don’t steal articles for republication, use caching so you don’t make too many redundant requests to the target server, credit your source, etc.. If you do some Googling, you’ll probably find some relative articles.