Code and data for Zhou et al. "Cross-Lingual Speaker Identification Using Distant Supervision", Arxiv 2022
We provide the distant supervision data as reported: data/si_distant_55k.csv
We provide the datasets (train/test) that we used for experiments under data
Our distant supervision extraction pipeline is provided under code/extraction
.
The current code we provide is for reference, and we will further clean the code into more usable packing.
gutenberg_pp_style_sample.txt
is a sample data from the Gutenberg project, in a chapter-by-chapter format.
To run the extraction, start with the run_gutenberg_coref()
function in extractor_distant.py
, which will run a coreference model on the Gutenberg data.
Then, run_coref_gutenberg.py
provides a parse_overlap()
function that runs rule-based speaker identification based on the coreference results, according to Section3 of the paper.
Finally, format_to_roberta()
in extractor_distant.py
formats the results into Roberta modeling inputs and labels.
The training script is provided as code/experiments/run.sh
. We will update evaluation scripts soon.