« sxsw & etech |
Main
| flickr -> yahoo »
March 18, 2005
Amazon's Statistically Improbable Phrases
Posted by David Weinberger
RageBoy has discovered that Amazon seems to be rolling out a feature that shows you for any particular book which phrases in it are “statistically improbable.” For example, Chris’ own Gonzo Marketing uses the phrase “public journalism” and “market advocacy.” Obviously those are not phrases unique to Chris’ book, so Amazon is doing some sort of statistical analysis to find phrases that are significantly distinctive and prominent within a book and across books. Fascinating. And, as Chris points out, these SIPs can serve as machine-generated tags. [Technorati tag: tags]
Comments (4)
+ TrackBacks (0) | Category: social software
- RELATED ENTRIES
- Spolsky on Blog Comments: Scale matters
- "The internet's output is data, but its product is freedom"
- Andrew Keen: Rescuing 'Luddite' from the Luddites
- knowledge access as a public good
- viewing American class divisions through Facebook and MySpace
- Gorman, redux: The Siren Song of the Internet
- Mis-understanding Fred Wilson's 'Age and Entrepreneurship' argument
- The Future Belongs to Those Who Take The Present For Granted: A return to Fred Wilson's "age question"
1. Ray Schraff on March 19, 2005 2:31 PM writes...
I just stumbled on these today.
Permalink to CommentAnybody find documentation anywhere ?
2. Ray Schraff on March 24, 2005 8:39 AM writes...
Specifically....
Permalink to CommentCan these be accessed through their public facing Web Service ?
3. Matt Cook on March 29, 2005 12:45 AM writes...
www.amazon.com/gp/search-inside/sipshelp-dp.html
is the official Amazon help page on the topic.
'Amazon.com's Statistically Improbable Phrases, or "SIPs", show you the interesting, distinctive, or unlikely phrases that occur in the text of books in Search Inside the Book. Our computers scan the text of all books in the Search Inside program. If they find a phrase that occurs a large number of times in a particular book relative to how many times it occurs across all Search Inside books, that phrase is a SIP in that book.'
Which is a simple but effective way of finding keywords in a document. The idea is used in Google News' clustering algorithm I believe.
Permalink to Comment4. Matt Cook on March 29, 2005 12:50 AM writes...
http://www.amazon.com/gp/search-inside/sipshelp-dp.html
'Amazon.com's Statistically Improbable Phrases, or "SIPs", show you the interesting, distinctive, or unlikely phrases that occur in the text of books in Search Inside the Book. Our computers scan the text of all books in the Search Inside program. If they find a phrase that occurs a large number of times in a particular book relative to how many times it occurs across all Search Inside books, that phrase is a SIP in that book.'
A simple but effective way of finding keywords in a document - if you have a large enough corpus. This idea is used in Google News' clustering algorithm I believe.
Permalink to Comment