Cyber represents a model implementation that seamlessly integrates state-of-the-art (SOTA) world models with the proposed CyberOrigin Dataset, pushing the boundaries of artificial intelligence and machine learning.
Follow this document to train the models using our readily-available data or adapt your data for training.
Our data includes information from home services, the logistics industry, and laboratory scenarios. For more details, please refer to our Offical Data Website.
-
Format & Description
Currently, the dataset contains image tokens generated by Magvit2. For more information, please refer to the dataset card on Huggingface. -
Download the Dataset
The dataset is currently available on Huggingface.- 🤗 Pipette
- 🤗 Take Item
- 🤗 Twist Tube
- 🤗 Fold Towels
You can download the dataset using the following command:
bash ../scripts/download_dataset.sh
- Visualize the Dataset
You can visualize the dataset using this notebook. Make sure to install the jupyter before running the notebook.pip install jupyter notebook
bash ../scripts/download_dataset.sh
The following steps will guide you through training the a GENIE dynamic model on the CyberOrigin dataset.
Local Training
To train on your local machine using a single GPU, run:
python models/world/train_dynamic.py --data_dir data/cyber_pipette/data
Note: The model will train on the default configuration provided.
Training on Sagemaker
Using AWS Sagemaker for training allows you to leverage multiple GPUs on the cloud to speed up training. To train on Sagemaker, follow the instructions in the Sagemaker README.
The code is adapted from 1x's implementation of GENIE. The model is based on an ST-transformer architecture that predicts the next frame given the previous frames.
Model parameters tuning
The detailed configuration file is provided in the configs/models/world
folder.
{
"num_layers": 32, // number of ST-transformer blocks
"num_heads": 8, // number of heads in multi-head attention
"d_model": 256, // dimension of the model latent
"T": 16, // number of frames in the input sequence
"S": 256, // number of tokens in the input sequence S=16x16
"image_vocab_size": 262144, // codebook size for the image tokens
"use_mup": false, // whether to use MUP
"num_factored_vocabs": 2, // number of factored vocabularies
"qkv_bias": false, // whether to use bias in qkv projection
"proj_bias": true, // whether to use bias in projection
"attn_drop": 0, // dropout rate in attention
"qk_norm": false, // whether to normalize qk
"mlp_ratio": 4, // ratio of hidden size to model latent size in MLP
"mlp_drop": 0, // dropout rate in MLP
"mlp_bias": true // whether to use bias in MLP
}
It is recommended to only modify the first three parameters to adjust model size.
Training parameters tuning
Please refer to the help message for hyperparameter descriptions
python models/world/train.py -h
The code is modified from 1XGPT and Open-MAGVIT2 but removed unnecessary files and code.
Pretrained checkpoint
Download the checkpoint HERE Or run the command:
huggingface-cli download TencentARC/Open-MAGVIT2 imagenet_256_L.ckpt --repo-type dataset --local-dir ./experiments/
Try with our provided samples
We provide the notebook you can try to compress and decompress your video. Please try autoencoder_demo.ipynb and follow the instructions.
Compress your video data
Please follow the command below to encode and decode your data.
Compress videos to tokens:
python experiments/notebooks/compress_and_recon.py --config_file experiments/configs/models/world/openmagvit2.yaml --ckpt_path path/to/ckpt/file --video_path path/to/video/file --save_dir path/to/output/file --mode encode
Reconstruct videos from tokens:
python experiments/notebooks/compress_and_recon.py --config_file experiments/configs/models/world/openmagvit2.yaml --ckpt_path path/to/ckpt/file --tokens_path path/to/tokens/file --save_dir path/to/output/file --mode decode
Model training
image-folder
├── image_1.png
├── image_2.png
├── image_3.png
├── image_4.png
├── ...
The following command instructs you to train the Open-Magvit2 tokenizer on your customized image dataset, please ensure your data is the same structure as above.
python experiments/models/world/train_openmagvit2.py --config experiments/configs/models/world/openmagvit2.yaml --data_dir path/to/image/folder --output_dir path/to/output/folder
Please refer to openmagvit2.yaml for more hyperparameter descriptions.
The code is modified from Cosmos-Tokenizer but removed unnecessary files and code. Currently, Cosmos-Tokenizer is available for inference
only.
Pretrained checkpoint
Download the checkpoint HERE Or follow this snippet below:
from huggingface_hub import login, snapshot_download
import os
login(token="<YOUR-HF-TOKEN>", add_to_git_credential=True)
model_names = [
"Cosmos-Tokenizer-CI8x8",
"Cosmos-Tokenizer-CI16x16",
"Cosmos-Tokenizer-CV4x8x8",
"Cosmos-Tokenizer-CV8x8x8",
"Cosmos-Tokenizer-CV8x16x16",
"Cosmos-Tokenizer-DI8x8",
"Cosmos-Tokenizer-DI16x16",
"Cosmos-Tokenizer-DV4x8x8",
"Cosmos-Tokenizer-DV8x8x8",
"Cosmos-Tokenizer-DV8x16x16",
]
for model_name in model_names:
hf_repo = "nvidia/" + model_name
local_dir = "pretrained_ckpts/" + model_name
os.makedirs(local_dir, exist_ok=True)
print(f"downloading {model_name}...")
snapshot_download(repo_id=hf_repo, local_dir=local_dir)
Try with our provided samples
We provide the notebook you can try to encode your data in discrete tokens and continuous latent space, and decode tokens for visualization. Please try cosmos_demo.ipynb and follow the instructions.