Hash Collision #94

MikPisula · 2022-08-17T00:53:25Z

Describe the bug
Hash collision occurs with videos of the same length and with similar colour schemes.

To Reproduce

v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4')
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov')
print (v1 == v2)

Expected behavior
The hashes of the videos should be different.

Screenshots
NA

Please complete the following information:

Operating system: NA
Python Version: 3.10.5
VideoHash version: 3.0.1

Additional context

Demmenie · 2022-10-12T16:51:13Z

I have noticed this too, any idea what's causing it?

dale-wahl · 2022-10-13T14:08:07Z

Just learning this library myself, but if you check out the collages, you can see the collage images are virtually identical (located at v1.collage_path and v2.collage_path). Basically the scenes are too short and as far as the video hash is concerned consist of white pixels at the exact same two points and black everywhere else. My guess is that this will not be an effective tool with short videos such as the two in the example. I have been trying to find recommendations on minimum scene lengths.

Just did some testing, and you can increase the number of frames per second. Check out the results of this:

v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4',frame_interval=5)
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov',frame_interval=5)
print (v1 == v2)
# and compare their collages to the ones you created without using frame_interval
print(v1.collage_path)
print(v2.collage_path)

Demmenie · 2022-10-15T17:25:01Z

I'll have to look at that example later. I've also had the opposite problem where the same video will produce different hashes, not to mention that it always takes a few seconds to run which is quite long for real-world applications these days.

I think I'll either have to fork this and see if I can improve or switch to using something else. I'd also like to see if I can add partial fingerprint, where a video that's part of another one can be recognised as such.

MikPisula · 2022-10-17T21:42:59Z

Just learning this library myself, but if you check out the collages, you can see the collage images are virtually identical (located at v1.collage_path and v2.collage_path). Basically the scenes are too short and as far as the video hash is concerned consist of white pixels at the exact same two points and black everywhere else. My guess is that this will not be an effective tool with short videos such as the two in the example. I have been trying to find recommendations on minimum scene lengths.

Just did some testing, and you can increase the number of frames per second. Check out the results of this:
v1 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008752-da1f09c7-a177-4a46-9c64-230744e998c1.mp4',frame_interval=5)
v2 = VideoHash(url='https://user-images.githubusercontent.com/47534140/185008748-b8922142-37cc-48a0-bad9-1385ba016587.mov',frame_interval=5)
print (v1 == v2)
# and compare their collages to the ones you created without using frame_interval
print(v1.collage_path)
print(v2.collage_path)

The issue of collages for short videos being almost entirely black seems to stem from the fact that the width of the collage is set to 1024px no matter what. Instead, i tried editing collagemaker.py so that it would calculate the width of the collage based on the already-existing variable self.images_per_row_in_collage, and it resulted in much nicer collages although i have not tested it extensively. From my limited testing it produces the same hash for a video when:

it is converted to a different format (tested on .mov)
is is compressed
it is downscaled (by 50%)

And, more importantly, it produces different hashes for the two videos I uploaded in the original issue.

Link: MikPisula@b4b8f32

MikPisula · 2022-10-17T21:47:53Z

When it comes to the performance, perhaps the python multiprocessing library could be used to speed up the image-manipulation part?

Demmenie · 2022-10-17T22:38:16Z

It could do but it has to be done in a way that works across devices. I think an algorithm with decent time complexity would be best. I'm also thinking it might be better to start over than to fork. I'd like to see if video fingerprinting might be possible.

Edit: I just found this: https://pypi.org/project/videofingerprint/
Looks like @akamhy was working on it but the repo doesn't exist anymore.
(Gonna start a separate issue for speed)

MikPisula added the bug Something isn't working label Aug 17, 2022

96jaco96 mentioned this issue Jan 10, 2023

Video hashs on vastly different videos yield is_similar() True #100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash Collision #94

Hash Collision #94

MikPisula commented Aug 17, 2022

Demmenie commented Oct 12, 2022

dale-wahl commented Oct 13, 2022

Demmenie commented Oct 15, 2022

MikPisula commented Oct 17, 2022

MikPisula commented Oct 17, 2022

Demmenie commented Oct 17, 2022 •

edited

Loading

Hash Collision #94

Hash Collision #94

Comments

MikPisula commented Aug 17, 2022

Demmenie commented Oct 12, 2022

dale-wahl commented Oct 13, 2022

Demmenie commented Oct 15, 2022

MikPisula commented Oct 17, 2022

MikPisula commented Oct 17, 2022

Demmenie commented Oct 17, 2022 • edited Loading

Demmenie commented Oct 17, 2022 •

edited

Loading