Skip to content

Latest commit

 

History

History
33 lines (26 loc) · 1.74 KB

README.md

File metadata and controls

33 lines (26 loc) · 1.74 KB

asr-errors-simulator

ASR Errors Simulator

Install:

pip install opus-fast-mosestokenizer pathos tqdm Bio numpy

Use for rewriting rules:

$ python3 rewrite2.py -h
usage: rewrite2.py [-h] [--punct PUNCT] [--casing CASING] [--cap-start] [--full-stop] [--tag-gaps] [--lang LANG] [--seed [SEED]] rules goal_wer data output

positional arguments:
  rules
  goal_wer         Goal WER, a float between 0.0 and 1.0.
  data             Input file. If -, use stdin.
  output

options:
  -h, --help       show this help message and exit
  --punct PUNCT    Punctuation option: keep random no. Keep: the punctuation is kept from the input sentence. Random: punctuation tokens are treated as normal tokens, e.g. as OOV if not in rules, and randomly transmitted/substituted/inserted/deleted. No: all punctuation is
                   retrieved from input. It can appear in the output only if it is a part of non-punct-only token, or generated by a rule.
  --casing CASING  Output casing option: keep lower. Keep: keep casing from the input in transmission and substitution. Lower: make everything lowercase.Note that it is assumed that the rules are all in lowercase, so each token is lowercased before searching the rule for it.
  --cap-start      Capitalize the first character of each output line.
  --full-stop      Full-stop: make sure that every line of input is terminated by one punctuation mark, "." by default, or "!", "?" or other if keep or random generates it.
  --tag-gaps       Insert tag for every deleted word. The tag has a form e.g. '<gap:N>', where N is a number of characters.
  --lang LANG      Language option for MosesTokenizer. Default is "en".
  --seed [SEED]    Random seed. If this option is unused, the seed is not set. No argument: the seed is 1234.