Under Review
Website |
arXiv |
Model Checkpoints |
Dataset |
Model Card
As robots that follow natural language become more capable and prevalent, we need a benchmark to holistically develop and evaluate their ability to solve long-horizon mobile manipulation tasks in large, diverse environments. Robots must use visual and language understanding, navigation, and manipulation capabilities to tackle this challenge. Existing datasets do not integrate all these aspects, restricting their efficacy as benchmarks. To address this gap, we present the Language, Navigation, Manipulation, Perception (LaNMP) dataset and demonstrate the benefits of integrating these four capabilities and various modalities. LaNMP comprises 574 trajectories across eight simulated and real-world environments for long-horizon room-to-room pick-and-place tasks specified by natural language. Every trajectory consists of over 20 attributes, including RGB-D images, segmentations, and the poses of the robot body, end-effector, and grasped objects. We fine-tuned and tested two models in simulation and on a physical robot to demonstrate its efficacy in development and evaluation. The models perform suboptimally compared to humans across various metrics, indicating significant room for developing better multimodal mobile manipulation models using our benchmark.
More detailed dataset information can be found in the dataset card DataCard.md.
Download the dataset from this DropBox.
Code that opens, reads, and displays the dataset contents can be found in this Google Colab notebook.
The simulation dataset comes in a single hdf5 file, and has the following hierarchy:
sim_dataset.hdf5/
├── data_11:11:28/
│ ├── folder_0
│ ├── folder_1
│ └── folder_2
├── data_11:14:08/
│ ├── folder_0
│ └── ...
└── ...
Under each folder, there are three main numpy files: depth_<num>
, inst_seg_<num>
, and rgb_<num>
,
which correspond to the depth image, segmentation image, and rgb image, respectively.
Under the metadata for each folder, there is a dumped json describing other metadata of each time step. The detailed metadata can be found in the dataset card.
Similarly, the real dataset also comes in a single hdf5 file, and has the following hierarchy:
real_dataset.hdf5/
└── FloorTrajectories/
├── data_00/
│ ├── folder_10/
│ │ ├── gripper_depth_10
│ │ ├── gripper_image_10
│ │ ├── left_fisheye_depth_10
│ │ ├── left_fisheye_image_10
│ │ ├── right_fisheye_depth_10
│ │ ├── right_fisheye_image_10
│ │ └── metadata
│ └── folder_11/
│ ├── gripper_depth_10
│ ├── gripper_image_10
│ └── ...
├── data_01/
│ └── folder_10/
│ └── ...
└── ...
Note that the right fisheye is located on the right side of the robot, but points towards the left side. So the right fisheye produces the left half of the image, and the left one produces the right half.
The images have the following sizes:
key | shape |
---|---|
gripper_depth_10 | (480, 640) |
gripper_image_10 | (480, 640, 3) |
left_fisheye_depth_10 | (240, 424) |
left_fisheye_image_10 | (640, 480, 3) |
right_fisheye_depth_10 | (240, 424) |
right_fisheye_image_10 | (640, 480, 3) |
The detailed metadata can be found in the dataset card.
cd collect_sim
python install -r sim_reqs.txt
cd custom_ai2thor_lib_code
- Move the files to the ai2thor library folder in the virtual environment
- Collect data
python mani.py --scene "<scene number>" --command "<natural language command>"
. Use the following keys to move in the simulator:
- WASD: moving the robot base
- J/L: rotate the robot left/right
- I/K: moving the robot head up/down
- G: grasp
- R: release
- Up arrow/down arrow: move robot shoulder up/down
- 7/4: move end-effector left/right
- 8/5 move end-effector up/down
- 9/6 move end-effector forward/backward
- Q: end collection and save data
- CTRL+C: restart collection without saving
cd collect_real
conda create --name <env> --file spot_env.txt
- Create a map using
python record_env_graph.py
. See this for more details on how to record the map. - Collect data using the map
python collect_spot_data.py -u <map folder> -t "<natural language command>"
The RT-1 model from the paper "RT-1: Robotics Transformer for Real-World Control at Scale" by Brohan et al. was modified and fine-tuned on LaNMP. This model was trained and run on an NVIDIA 3090 GPU.
A forked implementation of RT1 (Robotic Transformer) originally inspired by the Google Research paper.
This implemenetation of RT-1 was pretrained on the Bridge dataset and further fine-tuned on our LaNMP dataset for evaluation. Please find details of the repository below
git clone [email protected]:h2r/LaNPM-Dataset.git
cd models/main_models/rt1
pip install -e .
This repository has 7 critical files/folders whose use cases are described below
main.py
: used to pretrain RT-1 on the bridge dataset. Modifying this file to accomodate different datasets requires changing theobservation_space
andaction_space
according to the dataset being loaded, as well as changing the dataset keys inrt1_pytorch/tokenizers/action_tokenizer.py
. Running this file saves a series of checkpoints and logs losses using weights and biasesmain_ft.py
: used to finetune RT-1 on the LaNMP dataset. This file has theobservation_space
andaction_space
and PyTorchDataLoader
already modified to accomodate for the LaNMP dataset finetuning (AI2Thor). Running this file saves a series of checkpoints and logs losses using weights and biasesmain_ft_eval.py
: used to run RT-1 in inference mode on the LaNMP dataset. This file has theobservation_space
andaction_space
and PyTorchDataLoader
already modified to accomodate for the LaNMP dataset (AI2Thor). The file iterates/loads all saved checkpoints from finetuning and runs RT-1 on inference mode for the validation dataset on each checkpoint. The script logs the test losses using weights and biasesai2thor_env.py
: contains a Gym environment style class to load and take steps in AI2Thor enivironment. This file is used to generate real-time trajectories based on the action tokens generated by a finetuned RT-1 model (specific for AI2Thor). The mainstep()
function takes/executes the generated action by RT-1 and returns a success message along with information about the environment state e.g. object or agent metadata, which can be saved to capture the trajectory taken by the agent for a given taskrollout_ai2thor.py
: interfaces between the finetuned RT-1 model (from a loaded checkpoint after finetuning on LaNMP) and theai2thor_env.py
Gym environment, in order to send observations from the AI2Thor environment to RT-1 and execute proposed action tokens by RT-1 on AI2Thor. Note that this file should not be run on a headless machine since it requires/deploys AI2Thor simulator GUIrt1_pytorch/rt1_policy.py
: contains the RT-1 model implementation in PyTorch. Theloss()
function performs forward pass of RT-1 for training andact()
function performs the forward pass during inference.lanmp_dataloader/rt1_dataloader.py
: contains theDatasetManager
class that extracts trajectories from the LaNMPsim_data.hdf5
dataset file. The script automatically separates train and validation subsets according to different splits e.g. k-fold by scene, task wise or for diversity ablation. TheDatasetManager
also handles tokenizing/detokenizing the raw trajectory data into 256 discrete buckets, whilst also chunking trajectories across non-overlapping window lengths of 6 steps
Most relevant files in this repository accept the same set of arguments that are detailed below
dataset
: only for themain.py
file, specifies the dataset on which the RT-1 model should be pretrainedtrain-split
: specifies what fraction of the loaded dataset should be used for training v.s. evaluationeval-split
: specifies what fraction of the laoded dataset should be used for evaluation v.s. trainingepochs
: total number of passes over the all batches of the training setlr
: learning rate for cross-entropy loss of RT1train-batch-size
: the number of trajectories from which to sample data for the current training batcheval-batch-size
: the number of trajectories from which to sample data for the current evaluation batchtrajectory-length
: the window size (context history oftrajecotry-length
previous images) used for each trajectory when feeding data to RT-1 model; this is set to 6 based on the RT-1 implementationsentence-transformer
: the language embedding to apply on the language-specified taskdevice
: the device to load the model/data onto during training/inferenceeval-freq
: the interval of batches at which to run evaluation/inference on the validation dataset (currently set to 0 inmain_ft.py
)checkpoint-freq
: the interval of batches at which to save a checkpoint during trainingcheckpoint-dir
: the directory path at which to save a checkpoint during trainingload-checkpoint
: (optional) path of the pretrained checkpoint to load for further fine-tuningwandb
: boolean determining if logging to weights and biases should happeneval-scene
: the AI2Thor scene number in the dataset that is held out of the training set for evaluation during k-fold cross validation across scenessplit-type
: determines the split type (i.e. k-fold by scene, task wise or diversity ablation) between train and evaluation used by theDatasetManager
inrt1_dataloader.py
num-diversity-scenes
: only ifsplit-type
isdiversity-ablation
, this is used to determine the total number of scenes to perform diversity ablation over i.e. maximum of 4 for LaNMP simulation datamax-diversity-trajectories
: only ifsplit-type
isdiversity-ablation
, this is used to determine the total number of trajectories that are divided evenly across the number ofnum-diversity-scenes
scenestrain-subbatch
: the batch size to use during training/finetuningeval-subbatch
: the batch size to use during evaluation
Please find the follow checkpoints samples that can be loaded to the RT-1 model. These can be found on the supplementary Google Drive associated with this project
sample_checkpoints/pretrained_bridge
: the final checkpoint saved when pretraining the RT-1 model on the Bridge datasetsample_checkpoints/task_gen
: the final checkpoint saved after finetuning RT-1 model on the task-wise split for the task generalization experimentsample_checkpoints/kfold_cross_val
: the final checkpoints saved after finetuning RT-1 model using k-fold cross validations where each fold represented a held out scene from AI2Thor
When running any of the finetuning or pretraining scripts, please ensure the following modules are loaded
module load cuda/11.8.0-lpttyok
module load cudnn/8.7.0.84-11.8-lg2dpd5
- Create a Python virtual environment using Python 3.9.16 using
python3.9 -m venv rt1_env
- Activate the virtual environment using
source rt1_env/bin/activate
- Install and load the CUDA Toolkit 11.8.0 and cuDNN 8.7.0
cd LaNMP-Dataset/models/main_models/rt1
- Load necessary libraries using
pip install -e .
or directly activate the savedrt1_env
folder usingsource rt1_env/bin/activate
(if Python 3.9 is loaded onto your system)
cd LaNMP-Dataset/models/main_models/rt1
- Open
main.py
and modify theload-checkpoint
argument toNone
(since we are pretraining from initialization) - Ensure the
checkpoint-dir
argument is a known and valid local path (where checkpoints during pretraining will be saved at thecheckpoint-freq
) - Set all other arguments in `main.py'
- Navigate to
LaNMP-Dataset/models/main_models/rt1/rt1_pytorch/tokenizers/action_tokenizer.py
- Ensure the
action_order
andaction_space
in lines 61 and 62 ofaction_tokenizer.py
fetch frombridge_keys
defined in line 56 - Run
python3 main.py
with all arguments input as required - Checkpoints for pretraining should be saved chronologically (by step number) in the
checkpoint-dir
directory
cd LaNMP-Dataset/models/main_models/rt1
- Open
main_ft.py
and modify theload-checkpoint
argument to the checkpoint path generated from pretraining or the path where the pretrained checkpoint (from Google Drive) is saved - Ensure the
checkpoint-dir
argument is a known and valid local path (where checkpoints during finetuning will be saved at thecheckpoint-freq
) - Set all other arguments in
main_ft.py' (particularly
split-type` defines the type of experiment to be run i.e. k-fold across scenes, task generalization or diversity ablations) - Navigate to
LaNMP-Dataset/models/main_models/rt1/rt1_pytorch/tokenizers/action_tokenizer.py
- Ensure the
action_order
andaction_space
in lines 61 and 62 ofaction_tokenizer.py
fetch fromlanmp_keys
defined in line 56 - Run
python3 main_ft.py
with all arguments input as required - Checkpoints for pretraining should be saved chronologically (by step number) in the
checkpoint-dir
directory
cd LaNMP-Dataset/models/main_models/rt1
- Open
main_ft_eval.py
and modify thecheckpoint-path
argument to the checkpoint path from pretraining, finetuning or one of the pre-saved checkpoints (from Google Drive) - Set all other arguments in
main_ft_eval.py' (particularly
split-type` defines the type of experiment to be run i.e. k-fold across scenes, task generalization or diversity ablations) - Navigate to
LaNMP-Dataset/models/main_models/rt1/rt1_pytorch/tokenizers/action_tokenizer.py
- Ensure the
action_order
andaction_space
in lines 61 and 62 ofaction_tokenizer.py
fetch fromlanmp_keys
defined in line 56 - Run
python3 main_ft_eval.py
with all arguments input as required - Evaluation loss logs should be reported on weights and biases as well as printed (mean ± std dev) on the terminal
The ALFRED Seq2Seq model from the paper "ALFRED A Benchmark for Interpreting Grounded Instructions for Everyday Tasks" by Shridhar et al. was modified and fine-tuned on LaNMP. This model was trained and ran on an NVIDIA 3090 GPU, so some of the following instructions assume the use of that GPU.
Preliminary:
- Create a Python virtual environment using Python 3.9:
python3.9 -m venv alfred-env
- Activate the virtual environment
source alfred-env/bin/activate
- Install and load CUDA Toolkit 11.8 and cuDNN 8.7
cd LaNMP-Dataset/models/main_models
export ALFRED_ROOT=$(pwd)/alfred
cd alfred
- Install all dependencies:
pip install -r requirements.txt
- Download the dataset from the DropBox
- Place the zipped dataset files in
LaNMP-Dataset/dataset
- Unzip the datasets
gunzip *.gz
Running training:
The original pretrained model used for fine-tuning can be downloaded from this Google Drive Folder.
- Place the model in
LaNMP-Dataset/models/main_models/alfred/pretrained
cd LaNMP-Dataset/models/main_models/alfred
- Extract the image features using the ResNet and save them to disk:
python models/utils/extract_resnet.py --gpu
- Fine-tune:
python models/train/train_seq2seq.py --model seq2seq_im_mask --dout exp/model:{model}_discrete_relative_fold1 --gpu --batch 8 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 --pp_data 'data/feats_discrete_relative_fold1' --split_keys 'data/splits/split_keys_discrete_relative_fold1.json --class_mode --relative --preprocess'
--class_mode
puts the model into classification mode to use cross-entropy loss and output discrete actions--relative
makes the model produce relative (delta between current step and next step) actions rather than global actions--preprocess
preprocesses the data and saves it on disk to be used for the training down the pipeline. This only needs to be ran once. It can be removed after the first time to only run the training.- More details on all the command-line arguments can be found at
LaNMP-Dataset/models/main_models/train/train_seq2seq.py
Running inference:
The simulated fine-tuned models can be downloaded from this Google Drive folder.
The simulated extracted ResNet visual features can be downloaded from this Google Drive folder.
- Place the model pth files in
LaNMP-Dataset/models/main_models/alfred/exp
- Place the zipped vision features file in
LaNMP-Dataset/models/main_models/alfred/data/vis_feats
- Unzip and extract the file
tar -xzvf vis_feats.tar.gz
cd LaNMP-Dataset/models/main_models/alfred
- Run inference using fold1's fine-tuned model:
python models/eval/eval_seq2seq.py --model_path exp/best_test_fold1.pth --gpu --model models.model.seq2seq_im_mask --pp_data data/feats_discrete_relative_fold1 --split_keys 'data/splits/split_keys_discrete_relative_fold1.json'
- The command assumes it is run on a machine with a GUI in order to run the AI2THOR simulator, i.e. not on a headless machine.
- To run other models instead of the "fold1" model, change any part that has "fold1" in the command to the desired model, e.g. "task" for the "best_test_task.pth" model.
- More details on all the command-line arguments can be found at
LaNMP-Dataset/models/main_models/eval/eval_seq2seq.py
.