quantling · AndreDerCoole · Jul 22, 2024 · Jul 23, 2024 · Jul 23, 2024 · Sep 2, 2024
diff --git a/README.rst b/README.rst
@@ -29,8 +29,8 @@ Overview
   :width: 800
   :alt: Box and arrow overview of the data flow and tasks of the articubench benchmark.
 
-The benchmarks defines three tasks: semantic only, semantic-acoustic and
-acoustic only.
+The benchmarks defines three tasks: acoustic only (copy-synthesis), semantic-acoustic and
+semantic only.
 
 Control Model comparison
 ------------------------
@@ -53,22 +53,57 @@ Benchmark Results
 -----------------
 Results will be published when they are available.
 
-===================  =====  =======  ===============  ==============
-Tiny                 PAULE  Inverse  Seg-Model*       Baseline-Model
-===================  =====  =======  ===============  ==============
-Total Score
-Articulatory Scores  
-Semantic Scores
-Acoustic Scores
-Tongue Height
-EMA sensors
-Max Velocity
-Max Jerk
-Classification
-semantic RMSE
-loudness envelope
-spectrogram RMSE
-===================  =====  =======  ===============  ==============
+========================  ==========  ==========  =======  ===============  ==============
+Tiny / copy-synthesis     PAULE-fast  PAULE-full  Inverse  Seg-Model*       Baseline-Model
+========================  ==========  ==========  =======  ===============  ==============
+Total Score               324.57      380.34               213.84           114.58
+Articulatory Scores                                                         50
+Semantic Scores                                                             64.58
+Acoustic Scores                                                             0
+Tongue Height             51.71       61.67                43.05            0
+EMA sensors               30.89       30.30                26.94            0
+Max Velocity              (0.0)       (0.43)               (19.80)          (50)
+Max Jerk                  (0.0)       (0.43)               (19.80)          (50)
+Classification            75.60       100                  75.08            64.58
+semantic RMSE             67.79       90.67                54.25            0
+loudness envelope         49.82       48.32                -5.35            0
+spectrogram RMSE          48.77       48.95                0.07             0
+========================  ==========  ==========  =======  ===============  ==============
+
+========================  ==========  ==========  =======  ===============  ==============
+Tiny / semantic-acoustic  PAULE-fast  PAULE-full  Inverse  Seg-Model*       Baseline-Model
+========================  ==========  ==========  =======  ===============  ==============
+Total Score               143.91      319.51               212.08           114.58
+Articulatory Scores                                                         50
+Semantic Scores                                                             64.58
+Acoustic Scores                                                             0
+Tongue Height             42.51       23.59                38.72            0
+EMA sensors               28.00       28.20                28.29            0
+Max Velocity              (0.0)       (0.0)                (19.80)          (50)
+Max Jerk                  (0.0)       (0.0)                (19.80)          (50)
+Classification            44.93       100                  75.08            64.58
+semantic RMSE             11.56       80.94                55.74            0
+loudness envelope         7.09        43.65                -6.20            0
+spectrogram RMSE          9.82        43.13                0.64             0
+========================  ==========  ==========  =======  ===============  ==============
+
+========================  ==========  ==========  =======  ===============  ==============
+Tiny / semantic-only      PAULE-fast  PAULE-full  Inverse  Seg-Model*       Baseline-Model
+========================  ==========  ==========  =======  ===============  ==============
+Total Score               195.3       250.90               259.65           114.58
+Articulatory Scores                                                         50
+Semantic Scores                                                             64.58
+Acoustic Scores                                                             0
+Tongue Height             41.23       47.31                20.75            0
+EMA sensors               28.84       28.74                28.62            0
+Max Velocity              (0.0)       (0.0)                (22.60)          (50)
+Max Jerk                  (0.0)       (0.0)                (22.60)          (50)
+Classification            74.72       99.98                95.47            64.58
+semantic RMSE             39.27       87.78                100              0
+loudness envelope         2.76        -10.41               -5.54            0
+spectrogram RMSE          8.53        -2.31                -2.25            0
+========================  ==========  ==========  =======  ===============  ==============
+
 
 ===================  =====  =======  ===============  ==============
 Small                PAULE  Inverse  Seg-Model*       Baseline-Model

diff --git a/docs/source/_static/copybutton.js b/docs/source/_static/copybutton.js
diff --git a/docs/source/about.rst b/docs/source/about.rst
@@ -19,7 +19,7 @@ Contact
 
 In case you want to contact the project maintainers, please send an email to
 
-      konstantin [dot] sering [at] uni [minus] tuebingen [dot de
+      konstantin [dot] sering [at] uni [minus] tuebingen [dot] de
 
 
 Citation
@@ -29,7 +29,7 @@ TODO see pyndl.
 
 Funding
 -------
-*articubench* was partially funded by an ERC Advanced Grant (no. 742545)..
+*articubench* was partially funded by an ERC Advanced Grant (no. 742545).
 
 
 Acknowledgements

diff --git a/docs/source/articubench.rst b/docs/source/articubench.rst
@@ -3,10 +3,28 @@ Package overview
 
 .. automodule:: articubench
 
-Submodules
+Submodules: score
 ----------
 
-.. automodule:: articubench.evaluate
+.. automodule:: articubench.score 
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+
+Submodules: control_models
+----------
+
+.. automodule:: articubench.control_models
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+
+Submodules: util
+----------
+
+.. automodule:: articubench.util
     :members:
     :undoc-members:
     :show-inheritance:

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -21,6 +21,7 @@
 
 extensions = [
     "sphinx.ext.autodoc",
+    "sphinx.ext.mathjax",
 ]
 
 templates_path = ["_templates"]

diff --git a/docs/source/control_models.rst b/docs/source/control_models.rst
@@ -0,0 +1,48 @@
+Control models
+==============
+
+Articubench has multiple control models that generate CPs for the VocalTractLab. Mainly the "Baseline", "Segment-based" and "PAULE" model are currently implemented, while the "Inverse" model is in production.
+
+Baseline
+--------
+
+The Baseline model is a simple model that always generates the same CPs for the VocalTractLab, which produce a simple neutral sounding "Schwa" sound when synthesized. 
+
+This model is used as a reference point for the other models to compare against.
+
+Inputs for all Tasks:
+    - Sequence length which is equal to the length of the target CPs (or half our signal length)
+
+
+Segment-based
+-------------
+
+The Segment-based model is based on using the Montreal Forced Aligner (MFA) to generate CPs. Here we first resynthesize the original audio and a text file corresponding to the spoken word.
+Then we can use the MFA to map text to phonemes and phonemes with the audio to CPs. Generally generating smooth CPs which are quite a good approximation of the original audio.
+
+Inputs can be:
+    - target semantic vector 
+    - target audio signal with sample rate
+
+If a target semantic vector is not provided, the model will use the target audio signal to generate CPs. Otherwise it will always produce CPs given the semantic vector.
+
+PAULE
+-----
+
+PAULE is a machine learning model which generates CPs based on the given Task. It uses a forward model to map CPs to audio, an inverse model to map audio to CPs and an embedder to map audio to a semantic space.
+It is also able to learn and update its model weights given the inputs.
+
+Inputs can be:
+    - target semantic vector (for semantic-only task)
+    - target audio signal with sample rate (for acoustic-only task)
+    - both (for acoustic-semvec task)
+
+There are currently two PAULE implementations available, the "Fast" and the "Acoustic-Semvec" model. The "Fast" model is simply a PAULE model with very short trainigns, while the "Acoustic-Semvec" model goes through multiple full training cycles.
+
+Inverse
+-------
+
+The Inverse model is a part of PAULE, generating CPs from audio signals. It is currently in production and not yet available for use.
+
+Inputs:
+    - target audio signal with sample rate
diff --git a/docs/source/data.rst b/docs/source/data.rst
@@ -6,12 +6,49 @@ Overview of the data sets included in articubench.
 
 Rationale
 =========
-There is a `tiny`, `small`, and a `normal` data set. The purpose of the data sets is...
+Articubench includes two primary datasets: tiny and small. These datasets are designed to support a range of benchmarking tasks from preliminary checks to statistically significant evaluations. Each dataset serves a specific purpose:
 
+- Tiny Dataset: The tiny dataset is a minimal set designed to ensure that models can correctly process inputs and produce reasonable outputs. It is primarily used for quick validation checks during development.
+
+- Small Dataset: The small dataset provides enough data to conduct meaningful benchmarking with some statistical power. It is used to compare the performance of different models on a controlled, yet manageable, dataset.
 
 Data sets
 =========
 
 Tiny
 ----
 
+The tiny dataset is a minimal subset of the full data intended for quick and basic functionality tests. It includes only four rows, each representing a different data instance. The primary purpose of this dataset is to verify that the provided models can correctly intake data and return reasonable results without errors. The small size allows for rapid iteration and debugging during model development.
+
+
+Each row in the tiny dataset includes the following columns:
+
+    - file: The name of the audio file associated with the data point.
+    - label: The label or class associated with the data point.
+    - target_semantic_vector: A vector representing the semantic content of the target audio.
+    - target_sig: The waveform of the target audio signal.
+    - target_sr: The sample rate of the target audio signal.
+    - len_cp: The length of the control parameter (cp), calculated based on the duration of the target signal.
+    - reference_cp: The reference control parameters, normalized and truncated to match the length of the target signal.
+    - reference_tongue_height: Placeholder for future data related to the tongue height (currently set to None).
+    - reference_ema_TT: Placeholder for Electromagnetic Articulography (EMA) data for the tongue tip (TT), currently None.
+    - reference_ema_TB: Placeholder for EMA data for the tongue body (TB), currently None.
+
+Small
+----
+
+The small dataset is a more comprehensive subset, consisting of approximately 2,000 rows. This dataset is a mixture of a subset of the GECO word corpus and 1,800 pronunciations of the words "ja" and "halt" with their respective EMA points.
+
+
+The small dataset includes the same columns as the tiny dataset, but with the following additional details:
+
+    - file: Names of files from the GECO corpus and specific pronunciations of "ja" and "halt."
+    - label: Corresponding labels for each file.
+    - target_semantic_vector: Vectors representing the semantic content of each target.
+    - target_sig: The waveform of each target audio signal.
+    - target_sr: Sample rates for each target signal.
+    - len_cp: Control parameter lengths adjusted for the target signal's duration.
+    - reference_cp: Normalized control parameters, truncated to the signal's length.
+    - reference_tongue_height: Placeholder for future tongue height data (currently set to None).
+    - reference_ema_TT: EMA data for the TT of "ja / halt" or None .
+    - reference_ema_TB: EMA data for the TB of "ja / halt" or None .
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -8,9 +8,13 @@
    :caption: Contents:
 
    quickstart
+   overview
    installation
+   control_models
    data
    scores
+   tasks
+   usage
    development
    articubench
    about

diff --git a/docs/source/overview.rst b/docs/source/overview.rst
@@ -0,0 +1,30 @@
+Overview
+================
+
+Articubench is a benchmark model to evaluate articulatory speech synthesis systems. This benchmark uses the VocalTractLab (VTL) as its articulatory speech synthesis simulator.
+First a model like PAULE is used to generate Control Point Trajectories (CPs) for the VocalTractLab. Afterwards these CPs get simulated by the VTL to generate audio.
+Lastly our audio gets mapped by an embedder onto a semantic embedding space to infer the intended meaning of our synthesized audio compared to our original target meaning.
+
+Originally the benchmark was designed to benchmark the PAULE model on three tasks: Acoustic only (Copy-Synthesis), Semantic only and Semantic-Acoustic. 
+But as long as certain requirements are met, the benchmark can be used to evaluate any control model that generates CPs for the VocalTractLab.
+How to use the benchmark is described in the `Usage` section.
+
+After CPs are generated, the VTL 'speaks' them which gives us a signal with a sample rate of 44100. Furthermore we generate tongue height and EMA data from the VTL given our CPs.
+
+Now the Benchmark calculates Scores described in the `Scores` section to evaluate the performance of the control model. 
+
+Since all scores are calculated by comparing to a 'baseline model' which is always performing a "Schwa" sound no matter the input, the baseline model can unexpectedly outperform the control model in some cases.
+
+Specifically the jerk or velocity losses of our CPs and the loudness envelope can be quite good if we are using a very small dataset which contains audio similar to a "schwa".
+
+
+
+Implementation Notes
+--------------------
+
+- All tasks currently operate on word-level inputs
+- cp-trajectories must match VocalTractLab requirements:
+    - 30 control parameters per timeframe
+    - 2.5ms timeframe resolution (110 samples at 44.1kHz)
+- Since CP resolution is higher than our EMA and ultrasound data, the model uses 1d interpolation to match the required resolution
+- Articubench uses multi-processing for all score and data calculations, apart from the intial CP-Generation which is done sequentially on the GPU since the model trains for each CP individually