Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deduplication with Ungoliant #93

Open
Hammamwa47 opened this issue Feb 24, 2023 · 1 comment
Open

[BUG] Deduplication with Ungoliant #93

Hammamwa47 opened this issue Feb 24, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@Hammamwa47
Copy link

Describe the bug
Hi, I tried to deduplicate downloaded web data (after processing it with ungoliant download and ungoliant pipeline), but i get warnings (see screenshot) and the execution ends without deduplication.

The command
ungoliant dedup <source> <destination>

Expected behavior
generated deduplicated dataset in destination directory

Screenshots
ungoliant

@Hammamwa47 Hammamwa47 added the bug Something isn't working label Feb 24, 2023
@Uinelj
Copy link
Member

Uinelj commented Feb 24, 2023

Hello!

The dedup command is now in oscar-tools, as we try to isolate OSCAR generation and operations on it.

However, we don't have a dedup step for "document" versions of OSCAR, and I'm not sure that the tooling is ready for the latest one. You can always try to run oscar-tools v2 extract-text to extract the text, then run oscar-tools v1 dedup, but you'll have a text corpus similar to OSCAR 2019, with no metadata.

Let me know what you're aiming for (deduplicated lines, document-level deduplication, etc). So that I can prioritize what needs to be implemented for your usecase. An alternative would be to use pyhton with some performance oriented modules such as ujson for json processing and some hashing lib such as xxhash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants