HOME WEB NEWS IMAGES CLASSIFIEDS YELLOW PAGESPOLLS - SURVEYS WIKI COUNTRIES PHOTOS US UK INDIA
Avoo.com provides meta search results from various sources

W-shingling


Google





A w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.

The document, "a rose is a rose is a rose" can be tokenized as follows:

(a,rose,is,a,rose,is,a,rose)

The set of all contiguous sequences of 4 tokens (N-grams, here: 4-grams) is

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) }

By removing duplicate elements from this set, a 4-shingling is obtained:

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings\' intersection and union, or

r(A,B)={{|S(A)\cap S(B)|}\over {|S(A)\cup S(B)|}}

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

See also

  • Concept Mining offers an alternative method for document similarity calculation with more computational complexity, but where the measure more closely models a human\'s perception of document similarity.

References

  • (Manber 1993) Finding Similar Files in a Large File System. Does not yet use the term "shingling". Available as PDF
  • (Broder, Glassman, Manasse, and Zweig 1997) Syntactic Clustering of the Web. SRC Technical Note #1997-015. Available as HTML

This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia


Advertise with Us | Search Marketing | Help | Suggest a Site | Privacy Policy
© 2008 www.avoo.com. All rights reserved.