The Anatomy of Search Technology: Crawling using Combinators

This is the second guest post (part 1) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.

What’s so hard about crawling the web?

Web crawlers have been around as long as the Web has — and before the web, there were crawlers for gopher and ftp. You would think that 25 years of experience would render crawling a solved problem, but the vast growth of the web and new inventions in the technology of webspam and other unsavory content results in a constant supply of new challenges. The general difficulty of tightly-coupled parallel programming also rears its head, as the web has scaled from millions to 100s of billions of pages.

Existing Open-Source Crawlers and Crawls



Lees het volledige artikel op HighScalability.com :
The Anatomy of Search Technology: Crawling using Combinators

Geplaatst door admin op 16:15. Onder HighScalability. U kan de reacties op dit artikel volgen via de RSS 2.0. U kan reageren op dit artikel

Leave a Reply

Inloggen | Ontwikkeld door Invisible Web bvba