Skip to content

code and data for Zhou et al. "Cross-Lingual Speaker Identification Using Distant Supervision", Arxiv 2022

Notifications You must be signed in to change notification settings

Slash0BZ/speaker-identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

speaker-identification

Code and data for Zhou et al. "Cross-Lingual Speaker Identification Using Distant Supervision", Arxiv 2022

Data

Distant Supervision

We provide the distant supervision data as reported: data/si_distant_55k.csv

Speaker Identification Data

We provide the datasets (train/test) that we used for experiments under data

Code

Distant Supervision Extraction (WIP)

Our distant supervision extraction pipeline is provided under code/extraction. The current code we provide is for reference, and we will further clean the code into more usable packing. gutenberg_pp_style_sample.txt is a sample data from the Gutenberg project, in a chapter-by-chapter format.

To run the extraction, start with the run_gutenberg_coref() function in extractor_distant.py, which will run a coreference model on the Gutenberg data. Then, run_coref_gutenberg.py provides a parse_overlap() function that runs rule-based speaker identification based on the coreference results, according to Section3 of the paper. Finally, format_to_roberta() in extractor_distant.py formats the results into Roberta modeling inputs and labels.

Experiments (WIP)

The training script is provided as code/experiments/run.sh. We will update evaluation scripts soon.

About

code and data for Zhou et al. "Cross-Lingual Speaker Identification Using Distant Supervision", Arxiv 2022

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published