Monthly Archives: May 2010

PHP Simple HTML DOM Parser

It’s always fun to obtain data from REST APIs and parse the XML or JSON response. Twitter, for sure, wouldn’t be what it is today if not for the thriving community of developers building applications that tie-in with the API.

But what do you do when you need to obtain information from a site that doesn’t have an API, or at least an RSS feed that you could dump into SimpleXML. You scrape the page. There are numerous methods of doing that, such as using file_get_contents() and passing the resulting HTML to Tidy (to convert everything to strict XHTML) before invoking SimpleXML.

One of the simplest options is S.C. Chen’s PHP Simple HTML DOM Parser. Once you include the PHP library, you gain access to a set of functions that lets you read and modify HTML content with jQuery-like selectors.

Here is an example of scraping Slashdot headlines:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
 echo $article->find('div.title', 0)->plaintext;
}

As usual, with great power comes great responsibility. There are certain ethical guidelines to data scraping. Don’t steal articles for republication, use caching so you don’t make too many redundant requests to the target server, credit your source, etc.. If you do some Googling, you’ll probably find some relative articles.

BlogBuzz May 29, 2010

14 Mac Applications I Use Every Day

I made the switch to Mac OS X a little over two years ago when I bought my first MacBook (which is still working fine as my main computer, I might add). I find that my workflow has improved, and I’m more efficient in…

Facebook Privacy Scanning Bookmarklet

Unless you’ve been living under an internet-devoid rock, you have probably noticed the recent uproar over Facebook “privacy.” The social media giant made some changes, with various confusing privacy implications that have everyone panicking. By default applications can access some personal details (that you…

Apple, Amazon, Napster, Netflix Sued Over Online Music Distribution Patent

Ars Technica is reporting on a new development in the tech patent lawsuit war. Apple, along with Amazon, Netflix, Napster, Microsoft, Rhapsody and a few others, is being sued by Sharing Sound LLC over their infringement of the patent “Distribution of musical products by…

CloudApp: Share Files Fast

CloudApp is a nifty application that lets you quickly and easily share files. All you have to do is drag a file onto the menu bar. Once the file is uploaded to the “Cloud,” a short URL to the file is automatically copied to…

BlogBuzz May 22, 2010

Coming in WordPress 3.0: Custom Editor Stylesheets

I generally prefer to write my posts with WordPress’s visual editor, as it gives me a better idea of what the post will look like as I write it than the HTML view does. But it still looks different than the final post will…

Google Font API

Google is taking on projects like sIFR and Cufón with their new Google Font API. A simple line of JavaScript lets you load a font family from their directory of open source fonts, allowing you to safely reference it within your CSS. Here is…

A Standard to Specify a Canonical Short Link

There has been a small push to create a standard way for a web page to specify a preferred short link for use in places like Twitter. Something like the rel="canonical" trick that tells search engines which page on your domain is the one…

Page 1 of 3123