Clay Shirky
( Archive | Home )

Liz Lawley
( Archive | Home )

Ross Mayfield
( Archive | Home )

Sébastien Paquet
( Archive | Home )

David Weinberger
( Archive | Home )

danah boyd
( Archive | Home )

Guest Authors
Recent Comments

pet rescue saga cheats level 42 on My book. Let me show you it.

Affenspiele on My book. Let me show you it.

Affenspiele on My book. Let me Amazon show you it.

Donte on My book. Let me show you it.

telecharger subway surfers on My book. Let me show you it.

Ask Fm Anonymous Finder on My book. Let me show you it.

Site Search
Monthly Archives
RSS 1.0
RSS 2.0
In the Pipeline: Don't miss Derek Lowe's excellent commentary on drug discovery and the pharma industry in general at In the Pipeline


« HP Labs on "When can I expect an email response?" | Main | People Are the Problem »

October 6, 2003

Nutch: Are some social secrets necessary

Email This Entry

Posted by Clay Shirky

I just checked in on Nutch, the attempt at an open source search engine. There's not much visible movement on the project, or at least not much that can be gleaned by browsing the Sourceforge project page. Though projects go inactive for all sorts of reasons, I have always wondered if Nutch was doomed from the start. After seeing an unbelievable /. thread where a proud father was looking for advice on skewing Google searches to improve the PageRank of his baby pictures (no, really), I wondered if this might be a place where the 'special sauce' that drives search ranking might have to be closed, in order to prevent exactly this sort of social gaming. People only complain about Google searches when their placement falls, but the fact that they have no recourse may be an advantage. The incentive to manipulate search rankings is very high, and if the algorithm behind such searching were published, I wonder if it would be self-defeating? It may be that to manage a world of selfish actors, there needs to be the online equivalent of the Muslim family that keeps the key to the Church of the Holy Sepulchre in Jerusalem, since none of the Christian sects can be trusted not to lock one another out. Do we need someone ot keep the key to search away from the people who have the most interest in abusing it?

Comments (4) + TrackBacks (0) | Category: social software


1. Abe on October 6, 2003 10:40 PM writes...

is it possible that the reverse could be true? that keeping it secret actually increases the incentive to "game" the system.

One thing seems certain, keeping it secret increase the value of discovering some of the mechanisms. The lucky or diligent few that can uncover a trick that boosts their results have a valuable piece of info that gives them an advantage over those who do not know. When the info is kept secret is pretty much a sure thing that people will try and dig it out.

Now is it any better when its open? I'm not sure. My guess is that the effort to game the system broadens, but also flattens. It becomes harder to get enough of an advantage to really make it worth the time to seriously work the system. But a whole lot of people will deploy mild gaming tactics. Which might lead to a bit of a elitist bifurcation. Those with enough knowledge will get better search placement, while those that don't get pushed down.

Ultimately I'm not sure it makes a huge difference, a really good algorithm will route around most hack attempts anyway.

Permalink to Comment

2. Francis Hwang on October 6, 2003 11:56 PM writes...

I'm reminded of cryptography: New algorithms in the field aren't considered valid until they've been publically analyzed and deconstructed by competitors. It's not a completely analogous situation, of course, since secure communications between two agents is a much simpler problem in many ways than search relevance. Search relevance is, like most issues of value to human beings, really bloody hard to define in a way that machines can make use of.

It would be fascinating to see if anybody on this planet is smart enough to come up with a relevance algorithm that resists attempts to hack it even if the algorithm itself were completely open. But this problem may not gain much from an open-source model.

Some problems are more amenable to the open source process than others. Most programming is about incremental improvement: A good word processor is built on 1000 small features organized in such a way that they complement one another. But a small portion of programming is based on the sort of algorithmic inspiration than cannot be easily broken up into small pieces and solved by committee: artificial intelligence, 3d game engines, collaborative filtering. If somebody comes up with this holy grail of a relevance engine, she will do so largely on her own, regardless of how many open-source programmers sign up to actually implement it.

The pessimistic way to spin this is to say that this problem requires a particular genius, and whether or not that genius signs up to the open source movement or takes a job at Google is anybody's guess.

Permalink to Comment

3. Doug Cutting on October 7, 2003 2:10 PM writes...

These are valid musings, but only serve to point out what we don't know. Can Nutch work? We'll only find out if we try. Wouldn't it be great it if did work? And doesn't that possibility warrant the effort? I (obviously) think so.

Permalink to Comment

4. Dez Blanchfield on January 22, 2004 9:43 PM writes...

As one who has just "joined" the Nutch effort, per se, I would like to comment that right now, as of Jan 23rd 2004, the Nutch project has come a long way, I've easily implemented a quick trial index of around 26 million *.au ( Australia ) url's which I simply extracted from my own work of almost ten years ( http://WebSearch.COM.AU ) and was blown away when with a few days worth of effort, and a few quick hacks to fix some bugs, I managed to replicate ten years of personal development by running up Nutch and simple "inject"'ing ( Nutch speak ) my current Active catalogue of Australian domain name space URL's, kick off a crawl ( fetch ) and around two days later I'd built an almost mirror of http://WebSearch.COM.AU from this incredible open source effort! wow!

Even after sending a "how can I help!?" email, the responce was quick, personal and very positive.

This in contrast to the naturally secretive nature of commercial efforts like Google ( albeit they continue to act the lean mean search'in machine from their early days of college and some lego hard disk enclosures and some spare kit lying around the computer room - peh! ).

Am now running flat out to build a second index now of around 200+ million ( thats as big as I've estimated the sizing of my current development box will let me go before I have to buy another couple of terrabytes of DISk.

Your comments about social secrets being "best kept" and as being "necessary", I don't get to be honest - certainly there will always be issues like national security for all nations that must be kept at the time of consern, but like the NAZI codes of WWII, they were eventually cracked, opend and once the threat was gone, Enigma for example posed no further threat, the bunch of crossword gods cracking them at Bleatchley Park all went home, wrote books and got on with their lives ( if such a thing is possible after what they did ).

But with data being king ( well, information that is ), the key to the value of that data for the great unwashed masses has to be their ability to a) get access to it, b) find it, and surely most importantly in the end c) search it - to locate what they want accurately and in a timely, affordable fashion!

If you take the White Pages and Yellow Pages directories, once a source of power, control, and revenue for the phone companies, now just anther pile of junk paper, and used through the net, but the point being, everyone had access to them, and the public face of them was, and still is, generally a copy litterally hanging from the walls of your nearest corner phone booth!

If you take the opposite side of the coin and consider public documents burried in the basements of public office buildings run by our respective governments, who among us has, unless through absolute necessity such as land titles ( but even then we pay search companies or lawyers to go find them usually ) - but I wonder what is burried there in all that data, what deep dark hidden secrets must lay within?

How wonderful if you could as we do now, jump on the likes of AllTheWeb.COM and search for curious and interesting things like the history of say "crown chain and land titles" for example, and have it dig out the jewels of information that are currently hidden inadvertantly, in hardcopy documents around the world.

Recently I did some research, and was astounded to find approx 850,000 sites that lay some claim, albeit some were rather lame or half baked, to being an internet search engine, or internet directory!

850,000 web sites supposedly claiming to be search engines? isn't that wild!

Some of the sites I did review, were very valid and high quality "niche" market search portals per se, but on the whole, 99% were junk, also rans, trying to represent themselves as mini google, or mini yahoo's and all they were in effect were kids running the likes of Chatalogica's Links, Gossamer Threads, HTdig, ASPseek, or some home cooked per scripts, even powered by MySQL or Postgres, but I think it would be fair to say that most were junk, I don't like to say that in a horrible and negative way as there's no better learning process than building something and running it live on the internet, where else could you get free user testing from hundreds of millions of potential crazy key whackers!?

But what I do have an issue with, is that the average internet user is not necessarily as discerning as they might otherwise be off line, perhaps thats a fault of the authoritarian postion technology, and computers as a whole have acheived through their being the source of, and controlling point of much of our modern lives.

But imaging granny logging on, finally working out how to get windows on a pc to dial up, finding a "very interesting stuff dot com" search directory, only to end up frustrated by the lack of data, good info, etc there, and considering the internet to be a waste of time, or more dangerously, getting sent to a bogus or misleading web site or page, as is often the case with the flood ( still ) of porn sites posting to these tiny directories or search sites, under supposedly valid urls and pages, wait till they get indexed and see thier listing added or the spider crawl, and then quickly update the page and fill it with porn or porn links or redirects - old game, new risk!

Now Nutch can't fix the sorts of risks I'm talking about, but what it can do is provide a more common playing field for those who would build public search sites, allow them to scale and grow to google and yahoo and FAST challenging levels at the cost of the average Gamming PC these days ( AUD$4,000 gaming PC will get you a box capable of a 200 million urls, phew! that's the whole of Gigablast's efforts - sorry Matt - in a desktop pc and any 14 year old with some Java, Tomcat, Linux and time can do it now ).

I really do beleive that the more Nutch based sites we see, the more open the access to data will eventually be, and as the holders of public data like state and federal government agencies start to get their turks ( aka it geeks ) to install the likes of Nutch now, where they might previously not been able to afford a yellow box from Google ( which is useless to anyone outside the USA of course, and $65,000 for 150,000 URL's, please, that's just a joke! ), finally we might begin to see and for example, provide a search tool that lets me find stuff that currently the likes of CSIRO's mini google in a box ( well three 1RU linux's pc's that is ) tries to provide, but again at the sort of cost that most can't easily justify the cost.

If I said that I could for around AUD$10,000 build you a Nutch based system that could index 100 million html documents, and you could then dump that on your LAN, and start sucking your intranet for documents, or index that CDROM farm with the public document archive, and let it face the internet and Jane aveage could search it, wouldn't that blow your mind.

AUD$10,000 will buy you a 2RU, 3 Ghz Xeon, 2 GB or RAM, 8 x 250 Gb IDE two terrabyte RAID 5 ( 3ware 8 channel controller ) monster with dual gigabit LAN internfaces.

Then a weeks effort from an everyday techie with half a linux / java / tomcat clue, and you've got a search engine monster, when compared to the current offerings!

So are social secrets necessary, yes, but in the context of Google and co, not on your life.

Well, unless the paranoid dillusionals ( like me ) are right that the NSA is funding Google's data centre of what, 10,000+ pc's to store the internet and search it, for "interesting things" ??



ps: wild ramblings of "thinking aloud" - don't take any of this for anything more than just that, wild ramblings..


Permalink to Comment


TrackBack URL:

Listed below are links to weblogs that reference Nutch: Are some social secrets necessary:


Email this entry to:

Your email address:

Message (optional):

Spolsky on Blog Comments: Scale matters
"The internet's output is data, but its product is freedom"
Andrew Keen: Rescuing 'Luddite' from the Luddites
knowledge access as a public good
viewing American class divisions through Facebook and MySpace
Gorman, redux: The Siren Song of the Internet
Mis-understanding Fred Wilson's 'Age and Entrepreneurship' argument
The Future Belongs to Those Who Take The Present For Granted: A return to Fred Wilson's "age question"