Corante

Authors

Clay Shirky
( Archive | Home )

Liz Lawley
( Archive | Home )

Ross Mayfield
( Archive | Home )

Sébastien Paquet
( Archive | Home )

David Weinberger
( Archive | Home )

danah boyd
( Archive | Home )

Guest Authors
Site Search
Monthly Archives
Syndication
RSS 1.0
RSS 2.0
Don't Miss The AppGap, a blog on the future of the office and small business. Sponsored by QuickBase.

Many-to-Many

« sxsw & etech | Main | flickr -> yahoo »

March 18, 2005

Amazon's Statistically Improbable Phrases

Email This Entry

Posted by David Weinberger

RageBoy has discovered that Amazon seems to be rolling out a feature that shows you for any particular book which phrases in it are “statistically improbable.” For example, Chris’ own Gonzo Marketing uses the phrase “public journalism” and “market advocacy.” Obviously those are not phrases unique to Chris’ book, so Amazon is doing some sort of statistical analysis to find phrases that are significantly distinctive and prominent within a book and across books. Fascinating. And, as Chris points out, these SIPs can serve as machine-generated tags. [Technorati tag:]

Comments (4) + TrackBacks (0) | Category: social software


COMMENTS

1. Ray Schraff on March 19, 2005 2:31 PM writes...

I just stumbled on these today.
Anybody find documentation anywhere ?

Permalink to Comment

2. Ray Schraff on March 24, 2005 8:39 AM writes...

Specifically....
Can these be accessed through their public facing Web Service ?

Permalink to Comment

3. Matt Cook on March 29, 2005 12:45 AM writes...

www.amazon.com/gp/search-inside/sipshelp-dp.html

is the official Amazon help page on the topic.

'Amazon.com's Statistically Improbable Phrases, or "SIPs", show you the interesting, distinctive, or unlikely phrases that occur in the text of books in Search Inside the Book. Our computers scan the text of all books in the Search Inside program. If they find a phrase that occurs a large number of times in a particular book relative to how many times it occurs across all Search Inside books, that phrase is a SIP in that book.'

Which is a simple but effective way of finding keywords in a document. The idea is used in Google News' clustering algorithm I believe.

Permalink to Comment

4. Matt Cook on March 29, 2005 12:50 AM writes...

http://www.amazon.com/gp/search-inside/sipshelp-dp.html

'Amazon.com's Statistically Improbable Phrases, or "SIPs", show you the interesting, distinctive, or unlikely phrases that occur in the text of books in Search Inside the Book. Our computers scan the text of all books in the Search Inside program. If they find a phrase that occurs a large number of times in a particular book relative to how many times it occurs across all Search Inside books, that phrase is a SIP in that book.'

A simple but effective way of finding keywords in a document - if you have a large enough corpus. This idea is used in Google News' clustering algorithm I believe.

Permalink to Comment

TRACKBACKS

TrackBack URL:
http://www.corante.com/cgi-bin/mt/teriore.fcgi/1879.

Listed below are links to weblogs that reference Amazon's Statistically Improbable Phrases:


EMAIL THIS ENTRY TO A FRIEND

Email this entry to:

Your email address:

Message (optional):




RELATED ENTRIES
Spolsky on Blog Comments: Scale matters
"The internet's output is data, but its product is freedom"
Andrew Keen: Rescuing 'Luddite' from the Luddites
knowledge access as a public good
viewing American class divisions through Facebook and MySpace
Gorman, redux: The Siren Song of the Internet
Mis-understanding Fred Wilson's 'Age and Entrepreneurship' argument
The Future Belongs to Those Who Take The Present For Granted: A return to Fred Wilson's "age question"