Skip to content

Commit

Permalink
Documentation of merge-collections and clean step.
Browse files Browse the repository at this point in the history
  • Loading branch information
ZJaume committed Aug 14, 2024
1 parent eef98dd commit 04ff9f6
Show file tree
Hide file tree
Showing 2 changed files with 212 additions and 0 deletions.
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,25 @@ When all the deduplication tasks have finished, the annotation can be eexecuted.
```
./20.processing.sh
```
The output will be available at `$WORKSPACE/annotated`.

**Optional**: merge each language into a single directory:
```
source .env
srun -A $SBATCH_ACCOUNT --pty -p small --ntasks 1 --cpus-per-task 128 --mem-per-cpu 1750 -t 24:00:00 bash
```
```
module load parallel
parallel -j64 bash 21.merge-collections ::: `cat langs`
```
That will be available at `$WORKSPACE/collections-merged`.


**Optional**: run the cleaning from the merged collections:
```
./30.clean.sh
```
The output will be available at `$WORKSPACE/cleaned`.

## Output format
The output format is JSONL, where each line is a valid JSON value and a full document with all its metadata and text content.
Expand Down
193 changes: 193 additions & 0 deletions langs
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
ace_Arab
ace_Latn
afr_Latn
als_Latn
amh_Ethi
ara_Arab
asm_Beng
ast_Latn
awa_Deva
ayr_Latn
azb_Arab
azj_Latn
bak_Cyrl
bam_Latn
ban_Latn
bel_Cyrl
bem_Latn
ben_Beng
bho_Deva
bjn_Arab
bjn_Latn
bod_Tibt
bos_Latn
bug_Latn
bul_Cyrl
cat_Latn
ceb_Latn
ces_Latn
cjk_Latn
ckb_Arab
crh_Latn
cym_Latn
dan_Latn
deu_Latn
dik_Latn
dyu_Latn
dzo_Tibt
ell_Grek
eng_Latn
epo_Latn
est_Latn
eus_Latn
ewe_Latn
fao_Latn
fij_Latn
fin_Latn
fon_Latn
fra_Latn
fur_Latn
fuv_Latn
gaz_Latn
gla_Latn
gle_Latn
glg_Latn
grn_Latn
guj_Gujr
hat_Latn
hau_Latn
heb_Hebr
hin_Deva
hne_Deva
hrv_Latn
hun_Latn
hye_Armn
ibo_Latn
ilo_Latn
ind_Latn
isl_Latn
ita_Latn
jav_Latn
jpn_Jpan
kab_Latn
kac_Latn
kam_Latn
kan_Knda
kas_Arab
kas_Deva
kat_Geor
kaz_Cyrl
kbp_Latn
kea_Latn
khk_Cyrl
khm_Khmr
kik_Latn
kin_Latn
kir_Cyrl
kmb_Latn
kmr_Latn
knc_Arab
knc_Latn
kon_Latn
kor_Hang
lao_Laoo
lij_Latn
lim_Latn
lin_Latn
lit_Latn
lmo_Latn
ltg_Latn
ltz_Latn
lua_Latn
lug_Latn
luo_Latn
lus_Latn
lvs_Latn
mag_Deva
mai_Deva
mal_Mlym
mar_Deva
min_Latn
mkd_Cyrl
mlt_Latn
mni_Beng
mos_Latn
mri_Latn
mya_Mymr
nld_Latn
nno_Latn
nob_Latn
npi_Deva
nso_Latn
nus_Latn
nya_Latn
oci_Latn
ory_Orya
pag_Latn
pan_Guru
pap_Latn
pbt_Arab
pes_Arab
plt_Latn
pol_Latn
por_Latn
prs_Arab
quy_Latn
ron_Latn
run_Latn
rus_Cyrl
sag_Latn
san_Deva
sat_Olck
scn_Latn
shn_Mymr
sin_Sinh
slk_Latn
slv_Latn
smo_Latn
sna_Latn
snd_Arab
som_Latn
sot_Latn
spa_Latn
srd_Latn
srp_Cyrl
ssw_Latn
sun_Latn
swe_Latn
swh_Latn
szl_Latn
tam_Taml
taq_Latn
taq_Tfng
tat_Cyrl
tel_Telu
tgk_Cyrl
tgl_Latn
tha_Thai
tir_Ethi
tpi_Latn
tsn_Latn
tso_Latn
tuk_Latn
tum_Latn
tur_Latn
twi_Latn
tzm_Tfng
uig_Arab
ukr_Cyrl
umb_Latn
urd_Arab
uzn_Latn
vec_Latn
vie_Latn
war_Latn
wol_Latn
xho_Latn
ydd_Hebr
yor_Latn
yue_Hant
zho_Hans
zho_Hant
zsm_Latn
zul_Latn

0 comments on commit 04ff9f6

Please sign in to comment.