If you type a few letters into the search field over at the Internet Movie Database, you might notice how fast it is. That’s because they’re not served dynamically from their primary servers. IMDB, instead, serves the JSON data for search suggestions from a CDN, resulting in a significant speed boost. They use pregenerated static files to make this possible.
For example, if you visit this URL, you’ll get a JSON file of results for Harry Potter films:
The “h” directory means the query starts with an “h,” as they group their result sets alphabetically, and the “harry” part is what was typed into the search box. So if you wanted results that would match Doctor Who, you could use /d/doct.json. (Spaces are replaced with underscores.)
They only seem to have result sets for 4-5 character inputs, though. So you can query “ince” but not “inception.” The latter will result in an error. I guess most searches common enough to be matched in the suggestion box are covered within that limitation.
It’s a clever implementation, and it has to save a lot of computing power on a site that large, in addition to being fast.
(Note that this is not a public API, and IMDB/Amazon probably wouldn’t be happy about you scraping it or anything like that. But it’s a nice thing to learn from.)
Google recently made some tweaks to their algorithm in order to penalize content farms, which create massive amounts of low-quality content tuned to rank well in Google. If you’ve ever run a search, looking for a solution to a problem, and found the SERP to be full of not-really-helpful results from places like eHow and Squidoo, you know what they’re trying to fix.
Unfortunately, Google’s changes have been affecting legit blogs. One noteworthy example is Cult of Mac, a blog that aims to provide “timely news, insightful analysis, helpful how-tos and honest product reviews about Apple and Apple-related products.”
Cult of Mac has experienced the opposite of Google’s goal: their content has largely disappeared from Google’s SERPs, while content farms and spam-blogs scraping Cult of Mac posts have been pushed to the top.
A lot of our traffic came from Google, which is why the changes are so serious. I’m already seeing a big drop-off in traffic. Over the weekend and today, the traffic is half what it normally would be.
I’m pissed because we’ve worked our asses off over the last two years to make this a successful site. Cult of Mac is an independently owned small business. We’re a startup. We have a small but talented team, and I’m the only full timer. We’re busting our chops to produce high-quality, original content on a shoestring budget.
Indeed, Cult of Mac does break a lot of stories. Along with the Boy Genius Report, Mac Rumors and 9 to 5 Mac, they together are the source of the lion’s share of Apple-related reporting. It’s strange that Google’s algorithm would red-flag them as a content farm. Perhaps it is a result of “splogs” scraping their content; maybe a glitch in Google’s secret algorithm is causing one of the spam blogs to be marked as the original source for some reason or another?
Since Google rolled out Google Instant, there has been an explosion of search mashups created by enterprising developers. First there was YouTube Instant, for which the creator was offered a job at Google. Next came iTunes Instant, which lets you quickly search the iTunes store for music, netting the 15-year-old developer a 5% commission if you buy anything. (I was kicking myself for not thinking of that first…)
This past Wednesday, Google started rolling out the latest evolution of their search engine. “Google Instant,” as they call it, enables you to execute your searches significantly faster. It’s like the existing Suggest feature, only more so. The search results appear as you type, updating as you go.
This video demonstrates it well:
It takes a little time to get used to, but it actually does seem to be quicker. There isn’t a noticeable lag while results load, even on a 1mbps DSL line. The feature does disable itself on slower connections, though.
Microsoft recently launched their new search engine “Bing,” in an attempt to compete in the arena that Google has pretty much already won. There was a bit of talk leading up to the relatively quiet launch, which promptly disappointed me.
Yahoo is currently readying the next major update to it’s BOSS Search API. With it they will bring access to SearchMonkey data, optional longer abstracts, and greater flexibility for monetization. They will also be tracking API usage, and charging nominally for monthly usage greater than 10,000 queries.
Since launch, the BOSS API has been provided entirely for free. Now Yahoo is putting in place a freemium model where it’ll be free only for developers who generate fewer than 10,000 queries per day. After that, a tiered pricing model will kick in that charges for BOSS as if it were a utility (think AWS). Rates will vary depending on the type of query (web result vs. spelling correction, for example), how many results the developer wants returned per query (with a new maximum of 1000 results), and just how far the developer goes over the free queries cap. (Source: TechCrunch)
It seems reasonable, and Yahoo certainly is improving upon the service. Ten thousand queries for month is a pretty fair ceiling in my opinion. It’s plenty for experimentation or development, or even a web app of reasonably small size.
If Yahoo can leverage BOSS to save themselves from possible bankruptcy, and still have a unique and powerful service, that’s definitely fine by me.
Now pay attention Google: You dumped your API back in 2006, and it’s going to come back to haunt you unless you follow Yahoo’s lead. After a long draught, Yahoo’s getting their game back on, and they’re out to with the developers over with all of their tools.
Remember Yahoo’s BOSS Search API? Well, they’re not done with it yet. A new expansion to the service has just launched, though it is currently available only to “certain Yahoo! partners,” with a promise that it will eventually become part of the main BOSS API.
The new service is called “Vertical Lense Technology.” It allows for topical search engines to be created. These search engines, as seen on TechCrunch, the only site to use the service so far, can
Index the partner site in real-time. Whenever a post or comment is added, it’s sent to Yahoo to be instantly indexed.
Allow ranking tweaks. When you search on TechCrunch, you are shown a carefully adjusted mix of web and Crunch Network results, ordered in a way that hopefully fits their audience.
Since we publish using WordPress, supplying our data using the first API essentially required that we design and deploy a plugin that would send information to Yahoo’s servers every time there was a new post or comment on any of our blogs. We also needed to create a similar data indexing system for CrunchBase so that contributions there would show up in the results as well. To ensure that all of our archived content was incorporated in the search index, we supplied Yahoo with historical data dumps from all 10 sites. Perhaps needless to say, this took a considerable amount of time just to ensure that the data we indexed at Yahoo was accurate and complete.
We always hear about how Google doesn’t like duplicate content, and will penalize a page that has the same content as another. There are plenty of articles on optimizing sites to avoid having duplicate content internally, and articles ranting about scrapers.
What I want to know is what Google thinks about duplicate content cases such as Reference.com or the Associated Press.
Head over to Reference.com, the encyclopedia branch of the Ask.com network of reference sites. Enter a search term. Now go over to Wikipedia and enter the same search term. They’re the same! Reference.com is pulling Wikipedia articles onto their site and throwing in a few ads. (How are they doing this? Does Wikipedia have some sort of API?) What does Google think of this?
“Live Search” is a term that people started using somewhere along the line to refer to AJAX-y search forms that display results as you type, rather than taking you to a results page. Kind of like Apple’s Spotlight search in OSX, which I have to say works great.
Wouldn’t it be cool to have something like that on your blog? A search form that, as you type, displays the results in a dropdown instead of a results page? That’s where Sikbox comes in.
Sikbox is a Yahoo BOSS app that allows you to create a cut and paste live search solution. It’s free, easy to install, and you can even apply your own CSS styles to it (or pick from one of the pre-made themes). You can search the entire web with it, or limit it to your website, like most of us would probably do.
Here’s a screenshot of it in action:
If you click on one results, you jump right to it. It works pretty good. It seems to work a lot better than the default WordPress search system too (providing your blog is in the Yahoo index).