You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Hi, I tried to deduplicate downloaded web data (after processing it with ungoliant download and ungoliant pipeline), but i get warnings (see screenshot) and the execution ends without deduplication.
The command ungoliant dedup <source> <destination>
Expected behavior
generated deduplicated dataset in destination directory
Screenshots
The text was updated successfully, but these errors were encountered:
The dedup command is now in oscar-tools, as we try to isolate OSCAR generation and operations on it.
However, we don't have a dedup step for "document" versions of OSCAR, and I'm not sure that the tooling is ready for the latest one. You can always try to run oscar-tools v2 extract-text to extract the text, then run oscar-tools v1 dedup, but you'll have a text corpus similar to OSCAR 2019, with no metadata.
Let me know what you're aiming for (deduplicated lines, document-level deduplication, etc). So that I can prioritize what needs to be implemented for your usecase. An alternative would be to use pyhton with some performance oriented modules such as ujson for json processing and some hashing lib such as xxhash
Describe the bug
Hi, I tried to deduplicate downloaded web data (after processing it with
ungoliant download
andungoliant pipeline
), but i get warnings (see screenshot) and the execution ends without deduplication.The command
ungoliant dedup <source> <destination>
Expected behavior
generated deduplicated dataset in destination directory
Screenshots
The text was updated successfully, but these errors were encountered: