This file contains instructions to download the individual datasets used by Meta-Dataset, and convert them into a common format (one TFRecord file per class). See an overview for more context.
-
Download
ilsvrc2012_img_train.tar
, from the ILSVRC2012 website -
Extract it into
ILSVRC2012_img_train/
, which should contain 1000 files, namedn????????.tar
(expected time: ~30 minutes) -
Extract each of
ILSVRC2012_img_train/n????????.tar
in its own directory (expected time: ~30 minutes), for instance:for FILE in *.tar; do mkdir ${FILE/.tar/}; cd ${FILE/.tar/}; tar xvf ../$FILE; cd ..; done
-
Download the following two files into
ILSVRC2012_img_train/
: -
Launch the conversion script (Use
--dataset=ilsvrc_2012_v2
for the training only MetaDataset-v2 version):python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=ilsvrc_2012 \ --ilsvrc_2012_data_root=$DATASRC/ILSVRC2012_img_train \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take 4 to 12 hours, depending on the filesystem's latency and bandwidth.
-
Find the following outputs in
$RECORDS/ilsvrc_2012/
:- 1000 tfrecords files named
[0-999].tfrecords
dataset_spec.json
(see note 1)num_leaf_images.json
- 1000 tfrecords files named
-
Download
images_background.zip
andimages_evaluation.zip
-
Extract them into the same
omniglot/
directory -
Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=omniglot \ --omniglot_data_root=$DATASRC/omniglot \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take a few seconds.
-
Find the following outputs in
$RECORDS/omniglot/
:- 1623 tfrecords files named
[0-1622].tfrecords
dataset_spec.json
(see note 1)
- 1623 tfrecords files named
-
Download
fgvc-aircraft-2013b.tar.gz
-
Extract it into
fgvc-aircraft-2013b
-
Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=aircraft \ --aircraft_data_root=$DATASRC/fgvc-aircraft-2013b \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take 5 to 10 minutes.
-
Find the following outputs in
$RECORDS/aircraft/
:- 100 tfrecords files named
[0-99].tfrecords
dataset_spec.json
(see note 1)
- 100 tfrecords files named
-
Download
CUB_200_2011.tgz
-
Extract it into
CUB_200_2011/
(andattributes.txt
) -
Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=cu_birds \ --cu_birds_data_root=$DATASRC/CUB_200_2011 \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take around one minute.
-
Find the following outputs in
$RECORDS/cu_birds/
:- 200 tfrecords files named
[0-199].tfrecords
dataset_spec.json
(see note 1)
- 200 tfrecords files named
-
Download
dtd-r1.0.1.tar.gz
-
Extract it into
dtd/
-
Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=dtd \ --dtd_data_root=$DATASRC/dtd \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take a few seconds.
-
Find the following outputs in
$RECORDS/dtd/
:- 47 tfrecords files named
[0-46].tfrecords
dataset_spec.json
(see note 1)
- 47 tfrecords files named
-
Download all 345
.npy
files hosted on Google Cloud-
You can use
gsutil
to download them toquickdraw/
:gsutil -m cp gs://quickdraw_dataset/full/numpy_bitmap/*.npy $DATASRC/quickdraw
-
-
Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=quickdraw \ --quickdraw_data_root=$DATASRC/quickdraw \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take 3 to 4 hours.
-
Find the following outputs in
$RECORDS/quickdraw/
:- 345 tfrecords files named
[0-344].tfrecords
dataset_spec.json
(see note 1)
- 345 tfrecords files named
-
Download
fungi_train_val.tgz
andtrain_val_annotations.tgz
-
Extract them into the same
fungi/
directory. It should contain oneimages/
directory, as well astrain.json
andval.json
. -
Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=fungi \ --fungi_data_root=$DATASRC/fungi \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take 5 to 15 minutes.
-
Find the following outputs in
$RECORDS/fungi/
:- 1394 tfrecords files named
[0-1393].tfrecords
dataset_spec.json
(see note 1)
- 1394 tfrecords files named
-
Download
102flowers.tgz
andimagelabels.mat
-
Extract
102flowers.tgz
, it will create ajpg/
sub-directory -
Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=vgg_flower \ --vgg_flower_data_root=$DATASRC/vgg_flower \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take about one minute.
-
Find the following outputs in
$RECORDS/vgg_flower/
:- 102 tfrecords files named
[0-101].tfrecords
dataset_spec.json
(see note 1)
- 102 tfrecords files named
-
Download
GTSRB_Final_Training_Images.zip
If the link happens to be broken, browse the GTSRB dataset website for more information. -
Extract it in
$DATASRC
, it will create aGTSRB/
sub-directory -
Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=traffic_sign \ --traffic_sign_data_root=$DATASRC/GTSRB \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take about one minute.
-
Find the following outputs in
$RECORDS/traffic_sign/
:- 43 tfrecords files named
[0-42].tfrecords
dataset_spec.json
(see note 1)
- 43 tfrecords files named
-
Download the 2017 train images and annotations from http://cocodataset.org/:
-
You can use
gsutil
to download them tomscoco/
:cd $DATASRC/mscoco/ mkdir train2017 gsutil -m rsync gs://images.cocodataset.org/train2017 train2017 gsutil -m cp gs://images.cocodataset.org/annotations/annotations_trainval2017.zip unzip annotations_trainval2017.zip
-
Otherwise, you can download
train2017.zip
andannotations_trainval2017.zip
and extract them intomscoco/
.
-
-
Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \ --dataset=mscoco \ --mscoco_data_root=$DATASRC/mscoco \ --splits_root=$SPLITS \ --records_root=$RECORDS
-
Expect the conversion to take about 4 hours.
-
Find the following outputs in
$RECORDS/mscoco/
:- 80 tfrecords files named
[0-79].tfrecords
dataset_spec.json
(see note 1)
- 80 tfrecords files named
- A reference version
of each of the
dataset_spec.json
files is part of this repository. You can compare them with the version generated by the conversion process for troubleshooting.