-
Website
http://venturebeat.com/ -
Original page
http://venturebeat.com/2008/01/04/wikia-to-launch-new-search-engine-jan-7/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
Eric Eldon
349 comments · 13 points
-
edsion007
54 comments · 1 points
-
Haggie
87 comments · 3 points
-
Matt Marshall
48 comments · 2 points
-
MG Siegler
1126 comments · 30 points
-
-
Popular Threads
-
16-yr old launches Vye music-sharing site. Another Napster?
8 hours ago · 4 comments
-
How investigators tracked down a Modern Warfare 2 cyber pirate
2 weeks ago · 206 comments
-
Microsoft’s Ray Ozzie: Apps don’t make your phone special
5 days ago · 34 comments
-
Microsoft misses the boat on web applications
2 days ago · 9 comments
-
5 O’Clock Roundup: Nook sold out, Sony launches online store, Bing gets slammed
1 day ago · 2 comments
-
16-yr old launches Vye music-sharing site. Another Napster?
With Spinn3r:
http://spinn3r.com
We have a distributed crawler but we run it within our own cluster because having 10k clients wouldn't really buy us anything.
Of course maybe from Wikia's perspective this is just a blind HTTP fetch task and they then aggregate it locally within their cluster.
The comparison to Wikipedia might fall down here. Wikipedia has about 1.5M english pages. The net has billions.
You can't just rely on humans for this stuff.
Kevin
I am unable to find the Grub source code or a whole lot of technical literature on the system but that is going to be corrected very soon, I'm told.
The main bottlenecks in large-scale crawling (in this order) are probably crawl management (i.e., the human/software complexity side of managing and scaling a crawl to millions of hosts) and the bandwidth. The CPU power is not a major issue. Grub basically harvests bandwidth from clients.
But I would be concerned about the crawl management part of the grub approach - are they using a fairly brute-force approach to recrawling that wastes (other people's) bandwidth, as opposed to the smarter recrawling strategies used by the major engines? How do they deal, e.g., with requests by sites to immediately cease crawling a domain (due to possible or perceived misbehavior of the crawler or local problems at the site)? And it is not clear how grub is really fitting into the whole wikia approach.
As to your other question, the grub website is fairly light on docs about how they actually do the crawl and doesn't have the source code posted for people to tinker around. I expect that to change relatively soon.
The need to index the entire web stems not from number of results, but from the ability to serve an essentially infinite number of queries. For example, you can index a 100 million pages and just serve queries related to Entertainment or Sports. But then you can't compete with Google anymore!