Code for the Paper "Identifying Necessary Elements for BERT's Multilinguality"

About

These experiments aim at identifying necessary elements for BERT's multilinguality. The objective is to model this in a small, laboratory setting that allows for fast experimentation.

It allows to train BERT for two languages: English and Fake-English. The objective is to identify architectural properties of BERT as well as linguistic properties of the involved languages that are necessary in order for BERT to create multilingual representations.

Among the things investigated are

Number of Parameters
Shifting special tokens
Language specific position embeddings
Not replacing masked tokens with random tokens
Inverting the language order
Avoiding parallel training corpus

Language model fit is evaluated with perplexity. Multilinguality with Word Alignment, Sentence Retrieval and Word Translation. See the paper for more details.

Data

Unfortunately, due to copyright restrictions, the Easy-to-read Bible is currently not publicly available. Other data download links are in the paper.

Setup

The code is mostly based on huggingface transformers and their awesome pretraining scripts (thanks!). setup.sh preprocesses the data. run.sh contains all experiments in the small English-FakeEnglish setup. The folder real contains code for experiments on Wikipedia and XNLI data. real/run.sh contains all experiments for the real data setup.

References

You can find the paper on arxiv. It will appear in the Proceedings of EMNLP 2020.

@article{dufter2020identifying,
  title={Identifying Necessary Elements for BERT's Multilinguality},
  author={Dufter, Philipp and Sch{\"u}tze, Hinrich},
  journal={arXiv preprint arXiv:2005.00396},
  year={2020},
  comment={to appear in EMNLP 2020}
}

If you use the code, please consider citing the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
real		real
utils		utils
README.md		README.md
evaluate.py		evaluate.py
modifyinput.py		modifyinput.py
modifymodel.py		modifymodel.py
requirements.txt		requirements.txt
run.sh		run.sh
run_language_modeling.py		run_language_modeling.py
setup.sh		setup.sh
shift.py		shift.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for the Paper "Identifying Necessary Elements for BERT's Multilinguality"

About

Data

Setup

References

About

Releases

Packages

Contributors 2

Languages

pdufter/minimult

Folders and files

Latest commit

History

Repository files navigation

Code for the Paper "Identifying Necessary Elements for BERT's Multilinguality"

About

Data

Setup

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages