[BUG] Deduplication with Ungoliant #93

Hammamwa47 · 2023-02-24T13:35:38Z

Describe the bug
Hi, I tried to deduplicate downloaded web data (after processing it with ungoliant download and ungoliant pipeline), but i get warnings (see screenshot) and the execution ends without deduplication.

The command
ungoliant dedup <source> <destination>

Expected behavior
generated deduplicated dataset in destination directory

Screenshots

The text was updated successfully, but these errors were encountered:

Uinelj · 2023-02-24T15:36:06Z

Hello!

The dedup command is now in oscar-tools, as we try to isolate OSCAR generation and operations on it.

However, we don't have a dedup step for "document" versions of OSCAR, and I'm not sure that the tooling is ready for the latest one. You can always try to run oscar-tools v2 extract-text to extract the text, then run oscar-tools v1 dedup, but you'll have a text corpus similar to OSCAR 2019, with no metadata.

Let me know what you're aiming for (deduplicated lines, document-level deduplication, etc). So that I can prioritize what needs to be implemented for your usecase. An alternative would be to use pyhton with some performance oriented modules such as ujson for json processing and some hashing lib such as xxhash

Hammamwa47 added the bug Something isn't working label Feb 24, 2023

Hammamwa47 assigned Uinelj Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Deduplication with Ungoliant #93

[BUG] Deduplication with Ungoliant #93

Hammamwa47 commented Feb 24, 2023

Uinelj commented Feb 24, 2023

[BUG] Deduplication with Ungoliant #93

[BUG] Deduplication with Ungoliant #93

Comments

Hammamwa47 commented Feb 24, 2023

Uinelj commented Feb 24, 2023