feat(data): add data samples

sign-language-processing · Oct 25, 2024 · 6157727 · 6157727
1 parent 828f008
commit 6157727
Show file tree

Hide file tree

Showing 4 changed files with 89,129 additions and 762 deletions.
diff --git a/data/README.md b/data/README.md
@@ -1,47 +1,58 @@
 # Data
 
-Data includes MediaPipe poses of videos from Sign2MINT and Signsuisse, transcribed using SignWriting.
+Data includes MediaPipe poses of videos from multiple sources, transcribed using SignWriting.
 
 Data is got from the database using `get_data.py`.
 
+### Sources
+
+- ChicagoFSWild - About 50,000 fingerspelled signs. Low quality transcriptions. No specific indicator, except for using only hand symbols.
+- dictio - about 36,000 videos. pose files starts with "dictio". Every sign has two videos, one from a direct angle, and one from a side angle (unmarked).
+- Sign2MINT - about 5000 isolated signs from Sign2MINT. pose files starts with "s2m".
+- SignSuisse - about 4000 isolated signs from SignSuisse. pose files starts with "ss".
+- FLEURS-ASL - about 200, extremely high quality continuous sign language transcriptions with detailed facial expressions. pose files starts with "fasl".
+- `19097be0e2094c4aa6b2fdc208c8231e.pose` comes from [Why SignWriting?](https://www.youtube.com/watch?v=Mtl7dmyHgJU), and demonstrates transcription of continuous sign language.
 
 ## Poses
 
 Poses are collected using `collect_poses.py` and are available to download from [Google Cloud Storage](https://firebasestorage.googleapis.com/v0/b/sign-language-datasets/o/poses%2Fholistic%2Ftranscription.zip?alt=media).
 
-It is recommended to pre-process the poses before using them for training. For example:
+It is recommended to pre-process the poses when using them for training. For example:
 ```python
 from pose_format import Pose
 from pose_format.utils.generic import pose_normalization_info, correct_wrists, reduce_holistic
 
+# Load full pose video
 with open('example.pose', 'rb') as pose_file:
     pose = Pose.read(pose_file.read())
 
-# Remove legs, simplify face
+# Or load based on start and end time (faster)
+with open('example.pose', 'rb') as pose_file:
+    pose = Pose.read(pose_file.read(), start_time=0, end_time=10)
+
+# This imo is IDEAL for experimentation, but shouldn't be used for the final model
+## Remove legs, simplify face
 pose = reduce_holistic(pose)
-# Align hand wrists with body wrists
+## Align hand wrists with body wrists
 correct_wrists(pose)
-# Adjust pose based on shoulder positions
-pose = pose.normalize(pose_normalization_info(pose.header))
 
-# Save normalized pose
-with open('example.posebody', 'wb') as pose_file:
-    pose.write(pose_file)
+# This should be used always
+## Adjust pose based on shoulder positions
+pose = pose.normalize(pose_normalization_info(pose.header))
 ```
 
+## Issues to be aware of:
+
+- `.pose` files are not normalized, and are not centered around the origin.
+
+----
+
+Not sure if relevant anymore:
+
 ## Automatic Segmentation
 
 Most annotations come from single sign videos with the annotation spanning the entire video. 
 However, in real use cases, we would like to transcribe continuous signing, and training on full single-sign videos might not yield correct results.
 
 We automatically segment the single-sign videos using [sign-language-processing/segmentation](https://github.com/sign-language-processing/segmentation)
 to extract the sign boundary. Where successful, we record the new sign segments in data_segmentation.csv and use them for additional training data.
-
-## Issues
-
-- `.pose` files are not normalized, and are not centered around the origin.
-- `.pose` files do not allow `float` fps values, only `int` fps values. 
-  Therefore, every annotation that starts at $0$ should be assumed to end at the end.
-- `19097be0e2094c4aa6b2fdc208c8231e.pose` comes from [Why SignWriting?](https://www.youtube.com/watch?v=Mtl7dmyHgJU), 
-  and demonstrates transcription of continuous sign language. The actual frame rate is `29.970030`.
-  Therefore, it should only be used for testing continuous sign language transcription.
diff --git a/data/collect_poses.py b/data/collect_poses.py
@@ -7,10 +7,10 @@
 with open("data.csv", "r", encoding="utf-8") as f:
     data = list(DictReader(f))
 
-poses_location = Path("/scratch/amoryo/poses/sign-mt-poses")
+poses_location = Path("/Volumes/Echo/GCS/sign-mt-poses")
 
 unique_poses = set(datum["pose"] for datum in data)
 # Create a zip file with all the poses
-with zipfile.ZipFile("poses.zip", "w") as poses_zip:
+with zipfile.ZipFile("/Volumes/Echo/GCS/sign-language-datasets/poses/holistic/transcription.zip", "w") as poses_zip:
     for pose_name in tqdm(unique_poses):
         poses_zip.write(poses_location / pose_name, pose_name)