Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大数据分析笔记2-LSH #8

Open
SSK015 opened this issue May 29, 2023 · 0 comments
Open

大数据分析笔记2-LSH #8

SSK015 opened this issue May 29, 2023 · 0 comments

Comments

@SSK015
Copy link
Owner

SSK015 commented May 29, 2023

https://ssk015.github.io/dashuju2/

Click here Slide(pdf))

LSH: Locality-Sensitive Hashing
a way to find “similiar” sets.
upside: only a small faction of points are ever examined
downside: exists false negatives

Steps for similiar docs
fisrt. Shingling
Convert a document into a set

  1. hash shingles to a few bytes

  2. Compute Sim(C1, C2)
    How to compute?

Fisrt define the Jaccard Similarity.

JS(not dis) = num of (equal items which not 0) / num of (all items)
dis = 1 - JS.

The smaller the dis, the more similar the two vectors are.
second. Min-hashing
Convert large sets to short signatures, while preserving similarity.

Find a hash function h(x) make that:
if sim(C1, C2) is high, with high prob. h(c1) equal h(C2);

if low, with high prob. h(c1) != h(c2)
The function to our taste is Min-hashing
Then we get signatures.
third. LSH :
Focus on pairs of signatures likely to be from similiar documents
There uses bands and rows to reduce the error.
See the slides page 40.
Use hashing to find candidate pairs of simiarity >= s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant