Skip to content

Commit

Permalink
Add link to heliport
Browse files Browse the repository at this point in the history
  • Loading branch information
ZJaume committed Aug 14, 2024
1 parent d4392e5 commit 63c6d8f
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ The procedure does the following, in one job per collection:
### Annotation
The annotation step consists of adding multiple metadata fields to each document (using [annotate.py](scripts/annotate.py)):
- `id`: unique id for the document, derived from the WARC file, url and timestamp (`f`, `u`, `ts` fields).
- `seg-langs`: segment level language identification. An array of size equal to the number of segments in the document (each segment being delimited by a `\n`).
- `seg-langs`: segment level language identification. An array of size equal to the number of segments in the document (each segment being delimited by a `\n`). The language identifiaction tool for this step was [heliport](https://github.com/ZJaume/heliport), a fast port of HeLI-OTS trained with the same data as the language identifier for documents.
- `robots`: robots.txt compliance (if the document has been disallowed for crawling.
- [monofixer](https://github.com/bitextor/bifixer) to fix encoding issues and remove html entities. This step does not add any metadata field, it just fixes the document text.
- `pii`: look for PII information with [multilingual-pii-tool](https://github.com/mmanteli/multilingual-PII-tool). In case it any match is found, the field specifies the unicode character offsets for every match.
Expand Down

0 comments on commit 63c6d8f

Please sign in to comment.