Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated README with Scores #10

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 53 additions & 18 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ Overview
:width: 800
:alt: Box and arrow overview of the data flow and tasks of the articubench benchmark.

The benchmarks defines three tasks: semantic only, semantic-acoustic and
acoustic only.
The benchmarks defines three tasks: acoustic only (copy-synthesis), semantic-acoustic and
semantic only.

Control Model comparison
------------------------
Expand All @@ -53,22 +53,57 @@ Benchmark Results
-----------------
Results will be published when they are available.

=================== ===== ======= =============== ==============
Tiny PAULE Inverse Seg-Model* Baseline-Model
=================== ===== ======= =============== ==============
Total Score
Articulatory Scores
Semantic Scores
Acoustic Scores
Tongue Height
EMA sensors
Max Velocity
Max Jerk
Classification
semantic RMSE
loudness envelope
spectrogram RMSE
=================== ===== ======= =============== ==============
======================== ========== ========== ======= =============== ==============
Tiny / copy-synthesis PAULE-fast PAULE-full Inverse Seg-Model* Baseline-Model
======================== ========== ========== ======= =============== ==============
Total Score 324.57 380.34 213.84 114.58
Articulatory Scores 50
Semantic Scores 64.58
Acoustic Scores 0
Tongue Height 51.71 61.67 43.05 0
EMA sensors 30.89 30.30 26.94 0
Max Velocity (0.0) (0.43) (19.80) (50)
Max Jerk (0.0) (0.43) (19.80) (50)
Classification 75.60 100 75.08 64.58
semantic RMSE 67.79 90.67 54.25 0
loudness envelope 49.82 48.32 -5.35 0
spectrogram RMSE 48.77 48.95 0.07 0
======================== ========== ========== ======= =============== ==============

======================== ========== ========== ======= =============== ==============
Tiny / semantic-acoustic PAULE-fast PAULE-full Inverse Seg-Model* Baseline-Model
======================== ========== ========== ======= =============== ==============
Total Score 143.91 319.51 212.08 114.58
Articulatory Scores 50
Semantic Scores 64.58
Acoustic Scores 0
Tongue Height 42.51 23.59 38.72 0
EMA sensors 28.00 28.20 28.29 0
Max Velocity (0.0) (0.0) (19.80) (50)
Max Jerk (0.0) (0.0) (19.80) (50)
Classification 44.93 100 75.08 64.58
semantic RMSE 11.56 80.94 55.74 0
loudness envelope 7.09 43.65 -6.20 0
spectrogram RMSE 9.82 43.13 0.64 0
======================== ========== ========== ======= =============== ==============

======================== ========== ========== ======= =============== ==============
Tiny / semantic-only PAULE-fast PAULE-full Inverse Seg-Model* Baseline-Model
======================== ========== ========== ======= =============== ==============
Total Score 195.3 250.90 259.65 114.58
Articulatory Scores 50
Semantic Scores 64.58
Acoustic Scores 0
Tongue Height 41.23 47.31 20.75 0
EMA sensors 28.84 28.74 28.62 0
Max Velocity (0.0) (0.0) (22.60) (50)
Max Jerk (0.0) (0.0) (22.60) (50)
Classification 74.72 99.98 95.47 64.58
semantic RMSE 39.27 87.78 100 0
loudness envelope 2.76 -10.41 -5.54 0
spectrogram RMSE 8.53 -2.31 -2.25 0
======================== ========== ========== ======= =============== ==============


=================== ===== ======= =============== ==============
Small PAULE Inverse Seg-Model* Baseline-Model
Expand Down
63 changes: 0 additions & 63 deletions docs/source/_static/copybutton.js

This file was deleted.

4 changes: 2 additions & 2 deletions docs/source/about.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Contact

In case you want to contact the project maintainers, please send an email to

konstantin [dot] sering [at] uni [minus] tuebingen [dot de
konstantin [dot] sering [at] uni [minus] tuebingen [dot] de


Citation
Expand All @@ -29,7 +29,7 @@ TODO see pyndl.

Funding
-------
*articubench* was partially funded by an ERC Advanced Grant (no. 742545)..
*articubench* was partially funded by an ERC Advanced Grant (no. 742545).


Acknowledgements
Expand Down
22 changes: 20 additions & 2 deletions docs/source/articubench.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,28 @@ Package overview

.. automodule:: articubench

Submodules
Submodules: score
----------

.. automodule:: articubench.evaluate
.. automodule:: articubench.score
:members:
:undoc-members:
:show-inheritance:


Submodules: control_models
----------

.. automodule:: articubench.control_models
:members:
:undoc-members:
:show-inheritance:


Submodules: util
----------

.. automodule:: articubench.util
:members:
:undoc-members:
:show-inheritance:
Expand Down
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.mathjax",
]

templates_path = ["_templates"]
Expand Down
48 changes: 48 additions & 0 deletions docs/source/control_models.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
Control models
==============

Articubench has multiple control models that generate CPs for the VocalTractLab. Mainly the "Baseline", "Segment-based" and "PAULE" model are currently implemented, while the "Inverse" model is in production.

Baseline
--------

The Baseline model is a simple model that always generates the same CPs for the VocalTractLab, which produce a simple neutral sounding "Schwa" sound when synthesized.

This model is used as a reference point for the other models to compare against.

Inputs for all Tasks:
- Sequence length which is equal to the length of the target CPs (or half our signal length)


Segment-based
-------------

The Segment-based model is based on using the Montreal Forced Aligner (MFA) to generate CPs. Here we first resynthesize the original audio and a text file corresponding to the spoken word.
Then we can use the MFA to map text to phonemes and phonemes with the audio to CPs. Generally generating smooth CPs which are quite a good approximation of the original audio.

Inputs can be:
- target semantic vector
- target audio signal with sample rate

If a target semantic vector is not provided, the model will use the target audio signal to generate CPs. Otherwise it will always produce CPs given the semantic vector.

PAULE
-----

PAULE is a machine learning model which generates CPs based on the given Task. It uses a forward model to map CPs to audio, an inverse model to map audio to CPs and an embedder to map audio to a semantic space.
It is also able to learn and update its model weights given the inputs.

Inputs can be:
- target semantic vector (for semantic-only task)
- target audio signal with sample rate (for acoustic-only task)
- both (for acoustic-semvec task)

There are currently two PAULE implementations available, the "Fast" and the "Acoustic-Semvec" model. The "Fast" model is simply a PAULE model with very short trainigns, while the "Acoustic-Semvec" model goes through multiple full training cycles.

Inverse
-------

The Inverse model is a part of PAULE, generating CPs from audio signals. It is currently in production and not yet available for use.

Inputs:
- target audio signal with sample rate
39 changes: 38 additions & 1 deletion docs/source/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,49 @@ Overview of the data sets included in articubench.

Rationale
=========
There is a `tiny`, `small`, and a `normal` data set. The purpose of the data sets is...
Articubench includes two primary datasets: tiny and small. These datasets are designed to support a range of benchmarking tasks from preliminary checks to statistically significant evaluations. Each dataset serves a specific purpose:

- Tiny Dataset: The tiny dataset is a minimal set designed to ensure that models can correctly process inputs and produce reasonable outputs. It is primarily used for quick validation checks during development.

- Small Dataset: The small dataset provides enough data to conduct meaningful benchmarking with some statistical power. It is used to compare the performance of different models on a controlled, yet manageable, dataset.

Data sets
=========

Tiny
----

The tiny dataset is a minimal subset of the full data intended for quick and basic functionality tests. It includes only four rows, each representing a different data instance. The primary purpose of this dataset is to verify that the provided models can correctly intake data and return reasonable results without errors. The small size allows for rapid iteration and debugging during model development.


Each row in the tiny dataset includes the following columns:

- file: The name of the audio file associated with the data point.
- label: The label or class associated with the data point.
- target_semantic_vector: A vector representing the semantic content of the target audio.
- target_sig: The waveform of the target audio signal.
- target_sr: The sample rate of the target audio signal.
- len_cp: The length of the control parameter (cp), calculated based on the duration of the target signal.
- reference_cp: The reference control parameters, normalized and truncated to match the length of the target signal.
- reference_tongue_height: Placeholder for future data related to the tongue height (currently set to None).
- reference_ema_TT: Placeholder for Electromagnetic Articulography (EMA) data for the tongue tip (TT), currently None.
- reference_ema_TB: Placeholder for EMA data for the tongue body (TB), currently None.

Small
----

The small dataset is a more comprehensive subset, consisting of approximately 2,000 rows. This dataset is a mixture of a subset of the GECO word corpus and 1,800 pronunciations of the words "ja" and "halt" with their respective EMA points.


The small dataset includes the same columns as the tiny dataset, but with the following additional details:

- file: Names of files from the GECO corpus and specific pronunciations of "ja" and "halt."
- label: Corresponding labels for each file.
- target_semantic_vector: Vectors representing the semantic content of each target.
- target_sig: The waveform of each target audio signal.
- target_sr: Sample rates for each target signal.
- len_cp: Control parameter lengths adjusted for the target signal's duration.
- reference_cp: Normalized control parameters, truncated to the signal's length.
- reference_tongue_height: Placeholder for future tongue height data (currently set to None).
- reference_ema_TT: EMA data for the TT of "ja / halt" or None .
- reference_ema_TB: EMA data for the TB of "ja / halt" or None .
4 changes: 4 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,13 @@
:caption: Contents:

quickstart
overview
installation
control_models
data
scores
tasks
usage
development
articubench
about
Expand Down
30 changes: 30 additions & 0 deletions docs/source/overview.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
Overview
================

Articubench is a benchmark model to evaluate articulatory speech synthesis systems. This benchmark uses the VocalTractLab (VTL) as its articulatory speech synthesis simulator.
First a model like PAULE is used to generate Control Point Trajectories (CPs) for the VocalTractLab. Afterwards these CPs get simulated by the VTL to generate audio.
Lastly our audio gets mapped by an embedder onto a semantic embedding space to infer the intended meaning of our synthesized audio compared to our original target meaning.

Originally the benchmark was designed to benchmark the PAULE model on three tasks: Acoustic only (Copy-Synthesis), Semantic only and Semantic-Acoustic.
But as long as certain requirements are met, the benchmark can be used to evaluate any control model that generates CPs for the VocalTractLab.
How to use the benchmark is described in the `Usage` section.

After CPs are generated, the VTL 'speaks' them which gives us a signal with a sample rate of 44100. Furthermore we generate tongue height and EMA data from the VTL given our CPs.

Now the Benchmark calculates Scores described in the `Scores` section to evaluate the performance of the control model.

Since all scores are calculated by comparing to a 'baseline model' which is always performing a "Schwa" sound no matter the input, the baseline model can unexpectedly outperform the control model in some cases.

Specifically the jerk or velocity losses of our CPs and the loudness envelope can be quite good if we are using a very small dataset which contains audio similar to a "schwa".



Implementation Notes
--------------------

- All tasks currently operate on word-level inputs
- cp-trajectories must match VocalTractLab requirements:
- 30 control parameters per timeframe
- 2.5ms timeframe resolution (110 samples at 44.1kHz)
- Since CP resolution is higher than our EMA and ultrasound data, the model uses 1d interpolation to match the required resolution
- Articubench uses multi-processing for all score and data calculations, apart from the intial CP-Generation which is done sequentially on the GPU since the model trains for each CP individually
Loading