You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LSH: Locality-Sensitive Hashing
a way to find “similiar” sets.
upside: only a small faction of points are ever examined
downside: exists false negatives
Steps for similiar docs
fisrt. Shingling
Convert a document into a set
hash shingles to a few bytes
Compute Sim(C1, C2)
How to compute?
Fisrt define the Jaccard Similarity.
JS(not dis) = num of (equal items which not 0) / num of (all items)
dis = 1 - JS.
The smaller the dis, the more similar the two vectors are.
second. Min-hashing
Convert large sets to short signatures, while preserving similarity.
Find a hash function h(x) make that:
if sim(C1, C2) is high, with high prob. h(c1) equal h(C2);
if low, with high prob. h(c1) != h(c2)
The function to our taste is Min-hashing
Then we get signatures.
third. LSH :
Focus on pairs of signatures likely to be from similiar documents
There uses bands and rows to reduce the error.
See the slides page 40.
Use hashing to find candidate pairs of simiarity >= s
The text was updated successfully, but these errors were encountered:
https://ssk015.github.io/dashuju2/
Click here Slide(pdf))
LSH: Locality-Sensitive Hashing
a way to find “similiar” sets.
upside: only a small faction of points are ever examined
downside: exists false negatives
Steps for similiar docs
fisrt. Shingling
Convert a document into a set
hash shingles to a few bytes
Compute Sim(C1, C2)
How to compute?
Fisrt define the Jaccard Similarity.
JS(not dis) = num of (equal items which not 0) / num of (all items)
dis = 1 - JS.
The smaller the dis, the more similar the two vectors are.
second. Min-hashing
Convert large sets to short signatures, while preserving similarity.
Find a hash function h(x) make that:
if sim(C1, C2) is high, with high prob. h(c1) equal h(C2);
if low, with high prob. h(c1) != h(c2)
The function to our taste is Min-hashing
Then we get signatures.
third. LSH :
Focus on pairs of signatures likely to be from similiar documents
There uses bands and rows to reduce the error.
See the slides page 40.
Use hashing to find candidate pairs of simiarity >= s
The text was updated successfully, but these errors were encountered: