A set of scripts to build parallel corpora using Bilingual Sentence Aligner from Microsoft (by R.C.Moore)
${corpus_title}
|-- source
| |-- ${title}_${source_lang}.snt
| |-- ${title}_${target_lang}.snt
|-- work
| |-- ${source_lang}-${target_lang}
| |-- ${title}_${source_lang}.snt
| |-- ${title}_${target_lang}.snt
| |-- ${title}_${source_lang}.snt.aligned
| |-- ${title}_${target_lang}.snt.aligned
|-- aligned_idx
| |-- ${source_lang}-${target_lang}
| |-- ${title}.${source_lang}.idx
| |-- ${title}.${target_lang}.idx
|-- result
|-- ${corpus_title}.${source_lang}-${target_lang}.${source_lang}
|-- ${corpus_title}.${source_lang}-${target_lang}.${target_lang}
|-- ${corpus_title}.unique.${source_lang}-${target_lang}.${source_lang}
|-- ${corpus_title}.unique.${source_lang}-${target_lang}.${target_lang}
- Additional Python dependency: PyYAML. Install it using the
python -m pip install PyYAML
command if necessary. - A Perl interpreter must be also installed on your machine.
- Before running the shell script, put your source files in ${corpus_title}/source directory.
- The content of source files must be segmented in sentences (one sentence per line).
- Filenames of input files must have the following pattern: ${title}_${lang}.snt (e.g. document_en.snt).
- Parallel files must have identical titles (e.g. article_001_en.snt, article_001_fr.snt).
- There are two source data directories - 'original_source_data_directory' and 'preprocessed_source_data_directory' - specified in the YAML file. The 'original_source_data_directory' is used for files containing sentences in natural language (i.e. unmodified sentences). The 'preprocessed_source_data_directory' is used for additionaly preprocessed files originated from the 'original_source_data_directory' (e.g. stemmed files, additionally tokenized files etc.). The sentence alignment itself is done using the content from the 'preprocessed_source_data_directory'. On the contrary, the building of parallel corpora is done using the content from 'original_source_data_directory'. If no additional preprocessing has been made on source files, both paths must be equal.
- The 'work', 'aligned_idx' and 'result' directories are created automatically.
- Aligned corpora are placed in the 'result' directory.
Note: It is not necessary to keep all automatically created subdirectories (work, aligned_idx, result) under the same root but it is much easier to track the alignment process in this way.
(for running on Windows OS; replace values in square brackets with actual paths; see also io_args.yml.sample)
source_language: en
target_language: fr
corpus_title: aligned_corpora
original_source_data_directory: [...]\aligned_corpora\source
preprocessed_source_data_directory: [...]\aligned_corpora\source
work_directory: [...]\aligned_corpora\work
alignment_index_directory: [...]\aligned_corpora\aligned_idx
output_data_directory: [...]\aligned_corpora\result
- Enter the actual values for parameters in the configuration YAML file (see above).
- Specify the name of the configuration (YAML) file in the run_bsa.bat file (the value of config_file). The YAML file must reside in the script directory.
- Execute the following command (on Windows):
.\run_bsa.bat
Notes:
- The current set of scripts contains a slightly modified version of Bilingual Sentence Aligner in comparison to the original source. These modifications were implemented to minimize memory issues on larger corpora.
- The current set of scripts may be also run under UNIX/Linux OS. For this purpose, a Bash script similar to run_bsa.bat must be executed.