bsa-wrapper

A set of scripts to build parallel corpora using Bilingual Sentence Aligner from Microsoft (by R.C.Moore)

Usage

File system structure:


${corpus_title}
|-- source
|   |-- ${title}_${source_lang}.snt
|   |-- ${title}_${target_lang}.snt
|-- work
|   |-- ${source_lang}-${target_lang}
|       |-- ${title}_${source_lang}.snt
|       |-- ${title}_${target_lang}.snt
|       |-- ${title}_${source_lang}.snt.aligned
|       |-- ${title}_${target_lang}.snt.aligned
|-- aligned_idx
|   |-- ${source_lang}-${target_lang}
|       |-- ${title}.${source_lang}.idx
|       |-- ${title}.${target_lang}.idx
|-- result
    |-- ${corpus_title}.${source_lang}-${target_lang}.${source_lang}
    |-- ${corpus_title}.${source_lang}-${target_lang}.${target_lang}
    |-- ${corpus_title}.unique.${source_lang}-${target_lang}.${source_lang}
    |-- ${corpus_title}.unique.${source_lang}-${target_lang}.${target_lang}

Additional Python dependency: PyYAML. Install it using the python -m pip install PyYAML command if necessary.
A Perl interpreter must be also installed on your machine.
Before running the shell script, put your source files in ${corpus_title}/source directory.
The content of source files must be segmented in sentences (one sentence per line).
Filenames of input files must have the following pattern: ${title}_${lang}.snt (e.g. document_en.snt).
Parallel files must have identical titles (e.g. article_001_en.snt, article_001_fr.snt).
There are two source data directories - 'original_source_data_directory' and 'preprocessed_source_data_directory' - specified in the YAML file. The 'original_source_data_directory' is used for files containing sentences in natural language (i.e. unmodified sentences). The 'preprocessed_source_data_directory' is used for additionaly preprocessed files originated from the 'original_source_data_directory' (e.g. stemmed files, additionally tokenized files etc.). The sentence alignment itself is done using the content from the 'preprocessed_source_data_directory'. On the contrary, the building of parallel corpora is done using the content from 'original_source_data_directory'. If no additional preprocessing has been made on source files, both paths must be equal.
The 'work', 'aligned_idx' and 'result' directories are created automatically.
Aligned corpora are placed in the 'result' directory.

Note: It is not necessary to keep all automatically created subdirectories (work, aligned_idx, result) under the same root but it is much easier to track the alignment process in this way.

An example of a configuration file (YAML):

(for running on Windows OS; replace values in square brackets with actual paths; see also io_args.yml.sample)


source_language: en
target_language: fr

corpus_title: aligned_corpora

original_source_data_directory: [...]\aligned_corpora\source
preprocessed_source_data_directory: [...]\aligned_corpora\source
work_directory: [...]\aligned_corpora\work
alignment_index_directory: [...]\aligned_corpora\aligned_idx
output_data_directory: [...]\aligned_corpora\result

Running the shell script

Enter the actual values for parameters in the configuration YAML file (see above).
Specify the name of the configuration (YAML) file in the run_bsa.bat file (the value of config_file). The YAML file must reside in the script directory.
Execute the following command (on Windows):
.\run_bsa.bat

Notes:

The current set of scripts contains a slightly modified version of Bilingual Sentence Aligner in comparison to the original source. These modifications were implemented to minimize memory issues on larger corpora.
The current set of scripts may be also run under UNIX/Linux OS. For this purpose, a Bash script similar to run_bsa.bat must be executed.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
bsa		bsa
README.md		README.md
build_parallel_corpora.py		build_parallel_corpora.py
config.py		config.py
do_segment_alignment.py		do_segment_alignment.py
extract_unique_pairs.py		extract_unique_pairs.py
get_segment_alignments.py		get_segment_alignments.py
io_args.yml.sample		io_args.yml.sample
run_bsa.bat		run_bsa.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bsa-wrapper

Usage

File system structure:

An example of a configuration file (YAML):

Running the shell script

References:

About

Releases

Packages

Languages

janissl/bsa-wrapper

Folders and files

Latest commit

History

Repository files navigation

bsa-wrapper

Usage

File system structure:

An example of a configuration file (YAML):

Running the shell script

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages