Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash Collision #94

Open
MikPisula opened this issue Aug 17, 2022 · 6 comments
Open

Hash Collision #94

MikPisula opened this issue Aug 17, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@MikPisula
Copy link

Describe the bug
Hash collision occurs with videos of the same length and with similar colour schemes.

To Reproduce

v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4')
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov')
print (v1 == v2)

Expected behavior
The hashes of the videos should be different.

Screenshots
NA

Please complete the following information:

  • Operating system: NA
  • Python Version: 3.10.5
  • VideoHash version: 3.0.1

Additional context

@MikPisula MikPisula added the bug Something isn't working label Aug 17, 2022
@Demmenie
Copy link

I have noticed this too, any idea what's causing it?

@dale-wahl
Copy link

Just learning this library myself, but if you check out the collages, you can see the collage images are virtually identical (located at v1.collage_path and v2.collage_path). Basically the scenes are too short and as far as the video hash is concerned consist of white pixels at the exact same two points and black everywhere else. My guess is that this will not be an effective tool with short videos such as the two in the example. I have been trying to find recommendations on minimum scene lengths.

Just did some testing, and you can increase the number of frames per second. Check out the results of this:

v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4',frame_interval=5)
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov',frame_interval=5)
print (v1 == v2)
# and compare their collages to the ones you created without using frame_interval
print(v1.collage_path)
print(v2.collage_path)

@Demmenie
Copy link

I'll have to look at that example later. I've also had the opposite problem where the same video will produce different hashes, not to mention that it always takes a few seconds to run which is quite long for real-world applications these days.

I think I'll either have to fork this and see if I can improve or switch to using something else. I'd also like to see if I can add partial fingerprint, where a video that's part of another one can be recognised as such.

@MikPisula
Copy link
Author

Just learning this library myself, but if you check out the collages, you can see the collage images are virtually identical (located at v1.collage_path and v2.collage_path). Basically the scenes are too short and as far as the video hash is concerned consist of white pixels at the exact same two points and black everywhere else. My guess is that this will not be an effective tool with short videos such as the two in the example. I have been trying to find recommendations on minimum scene lengths.

Just did some testing, and you can increase the number of frames per second. Check out the results of this:

v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4',frame_interval=5)
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov',frame_interval=5)
print (v1 == v2)
# and compare their collages to the ones you created without using frame_interval
print(v1.collage_path)
print(v2.collage_path)

The issue of collages for short videos being almost entirely black seems to stem from the fact that the width of the collage is set to 1024px no matter what. Instead, i tried editing collagemaker.py so that it would calculate the width of the collage based on the already-existing variable self.images_per_row_in_collage, and it resulted in much nicer collages although i have not tested it extensively. From my limited testing it produces the same hash for a video when:

  1. it is converted to a different format (tested on .mov)
  2. is is compressed
  3. it is downscaled (by 50%)

And, more importantly, it produces different hashes for the two videos I uploaded in the original issue.

Link: MikPisula@b4b8f32

@MikPisula
Copy link
Author

When it comes to the performance, perhaps the python multiprocessing library could be used to speed up the image-manipulation part?

@Demmenie
Copy link

Demmenie commented Oct 17, 2022

It could do but it has to be done in a way that works across devices. I think an algorithm with decent time complexity would be best. I'm also thinking it might be better to start over than to fork. I'd like to see if video fingerprinting might be possible.

Edit: I just found this: https://pypi.org/project/videofingerprint/
Looks like @akamhy was working on it but the repo doesn't exist anymore.
(Gonna start a separate issue for speed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants