DISQUS

VentureBeat: Wikia to launch new social search engine, more on Monday

  • Kevin Burton · 1 year ago
    The biggest issue with building a crawler is not the bandwidth it's scaling and having fast access to local IO.

    With Spinn3r:

    http://spinn3r.com

    We have a distributed crawler but we run it within our own cluster because having 10k clients wouldn't really buy us anything.

    Of course maybe from Wikia's perspective this is just a blind HTTP fetch task and they then aggregate it locally within their cluster.

    The comparison to Wikipedia might fall down here. Wikipedia has about 1.5M english pages. The net has billions.

    You can't just rely on humans for this stuff.

    Kevin
  • Saumil Mehta · 1 year ago
    Actually, we should clarify "distributed search crawler" some more. I would assume that anyone that wants to crawl splits the crawl across n machines. I wasn't aware of your point around local IO being a bottleneck.

    I am unable to find the Grub source code or a whole lot of technical literature on the system but that is going to be corrected very soon, I'm told.
  • TS · 1 year ago
    Actually, access to local I/O is usually NOT the main bottleneck in crawling. Assuming of course that one does not write out each retrieved page in a separate random write, and this depends on the crawler architecture that is used. Of course, once you crawl tens of thousands of URLs per second, almost everything becomes a bottleneck, but there are only a few dozen players that have a need for that kind of speed.

    The main bottlenecks in large-scale crawling (in this order) are probably crawl management (i.e., the human/software complexity side of managing and scaling a crawl to millions of hosts) and the bandwidth. The CPU power is not a major issue. Grub basically harvests bandwidth from clients.

    But I would be concerned about the crawl management part of the grub approach - are they using a fairly brute-force approach to recrawling that wastes (other people's) bandwidth, as opposed to the smarter recrawling strategies used by the major engines? How do they deal, e.g., with requests by sites to immediately cease crawling a domain (due to possible or perceived misbehavior of the crawler or local problems at the site)? And it is not clear how grub is really fitting into the whole wikia approach.
  • Saumil Mehta · 1 year ago
    Well, I do understand how Grub fits into Wikia from a high level standpoint. They need to get a substantial crawl going to build a real search engine and they don't have Microsoft's and Yahoo's millions to start from scratch.

    As to your other question, the grub website is fairly light on docs about how they actually do the crawl and doesn't have the source code posted for people to tinker around. I expect that to change relatively soon.
  • Chris · 1 year ago
    How many people search beyond the first 2-5 pages? What is this need to index everything?
  • Saumil Mehta · 1 year ago
    Chris,

    The need to index the entire web stems not from number of results, but from the ability to serve an essentially infinite number of queries. For example, you can index a 100 million pages and just serve queries related to Entertainment or Sports. But then you can't compete with Google anymore!