name | pretrain | resolution | acc@1 | #params | FLOPs | TP. | Train TP. | configs/logs/ckpts |
---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 224x224 | 81.2 | 28M | 4.5G | 1244 | 987 | -- |
Swin-S | ImageNet-1K | 224x224 | 83.2 | 50M | 8.7G | 718 | 642 | -- |
Swin-B | ImageNet-1K | 224x224 | 83.5 | 88M | 15.4G | 458 | 496 | -- |
Vanilla-VMamba-T | ImageNet-1K | 224x224 | 82.2 | 23M | 638 | 195 | config/log/ckpt | |
Vanilla-VMamba-S | ImageNet-1K | 224x224 | 83.5 | 44M | 359 | 111 | config/log/ckpt | |
Vanilla-VMamba-B | ImageNet-1K | 224x224 | 83.7 | 76M | 268 | 84 | config/log/ckpt | |
VMamba-T[s2l5 ] |
ImageNet-1K | 224x224 | 82.5 | 31M | 4.9G | 1340 | 464 | config/log/ckpt |
VMamba-S[s2l15 ] |
ImageNet-1K | 224x224 | 83.6 | 50M | 8.7G | 877 | 314 | config/log/ckpt |
VMamba-B[s2l15 ] |
ImageNet-1K | 224x224 | 83.9 | 89M | 15.4G | 646 | 247 | config/log/ckpt |
VMamba-T[s1l8 ] |
ImageNet-1K | 224x224 | 82.6 | 30M | 4.9G | 1686 | 571 | config/log/ckpt |
VMamba-S[s1l20 ] |
ImageNet-1K | 224x224 | 83.3 | 49M | 8.6G | 1106 | 390 | config/log/ckpt |
VMamba-B[s1l20 ] |
ImageNet-1K | 224x224 | 83.8 | 87M | 15.2G | 827 | 313 | config/log/ckpt |
- Models in this subsection is trained from scratch with random or manual initialization. The hyper-parameters are inherited from Swin, except for
drop_path_rate
andEMA
. All models are trained with EMA except for theVanilla-VMamba-T
. TP.(Throughput)
andTrain TP. (Train Throughput)
are assessed on an A100 GPU paired with an AMD EPYC 7542 CPU, with batch size 128.Train TP.
is tested with mix-resolution, excluding the time consumption of optimizers.FLOPs
andparameters
are now gathered withhead
(In previous versions, without head, so the numbers raise a little bit).- we calculate
FLOPs
with the algorithm @albertgu provides, which will be bigger than previous calculation (which is based on theselective_scan_ref
function, and ignores the hardware-aware algorithm).
Backbone | #params | FLOPs | Detector | bboxAP | bboxAP50 | bboxAP75 | segmAP | segmAP50 | segmAP75 | configs/logs/ckpts |
---|---|---|---|---|---|---|---|---|---|---|
Swin-T | 48M | 267G | MaskRCNN@1x | 42.7 | 65.2 | 46.8 | 39.3 | 62.2 | 42.2 | -- |
Swin-S | 69M | 354G | MaskRCNN@1x | 44.8 | 66.6 | 48.9 | 40.9 | 63.4 | 44.2 | -- |
Swin-B | 107M | 496G | MaskRCNN@1x | 46.9 | -- | -- | 42.3 | -- | -- | -- |
Vanilla-VMamba-T | 42M | MaskRCNN@1x | 46.5 | 68.5 | 50.7 | 42.1 | 65.5 | 45.3 | config/log/ckpt | |
Vanilla-VMamba-S | 64M | MaskRCNN@1x | 48.2 | 69.7 | 52.5 | 43.0 | 66.6 | 46.4 | config/log/ckpt | |
Vanilla-VMamba-B | 96M | MaskRCNN@1x | 48.6 | 70.0 | 53.1 | 43.3 | 67.1 | 46.7 | config/log/ckpt | |
VMamba-T[s2l5 ] |
50M | 270G | MaskRCNN@1x | 47.4 | 69.5 | 52.0 | 42.7 | 66.3 | 46.0 | config/log/ckpt |
VMamba-S[s2l15 ] |
70M | 384G | MaskRCNN@1x | 48.7 | 70.0 | 53.4 | 43.7 | 67.3 | 47.0 | config/log/ckpt |
VMamba-B[s2l15 ] |
108M | 485G | MaskRCNN@1x | 49.2 | 71.4 | 54.0 | 44.1 | 68.3 | 47.7 | config/log/ckpt |
VMamba-B[s2l15 ] |
108M | 485G | MaskRCNN@1x[bs8 ] |
49.2 | 70.9 | 53.9 | 43.9 | 67.7 | 47.6 | config/log/ckpt |
VMamba-T[s1l8 ] |
50M | 271G | MaskRCNN@1x | 47.3 | 69.3 | 52.0 | 42.7 | 66.4 | 45.9 | config/log/ckpt |
:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
Swin-T | 48M | 267G | MaskRCNN@3x | 46.0 | 68.1 | 50.3 | 41.6 | 65.1 | 44.9 | -- |
Swin-S | 69M | 354G | MaskRCNN@3x | 48.2 | 69.8 | 52.8 | 43.2 | 67.0 | 46.1 | -- |
Vanilla-VMamba-T | 42M | MaskRCNN@3x | 48.5 | 70.0 | 52.7 | 43.2 | 66.9 | 46.4 | config/log/ckpt | |
Vanilla-VMamba-S | 64M | MaskRCNN@3x | 49.7 | 70.4 | 54.2 | 44.0 | 67.6 | 47.3 | config/log/ckpt | |
VMamba-T[s2l5 ] |
50M | 270G | MaskRCNN@3x | 48.9 | 70.6 | 53.6 | 43.7 | 67.7 | 46.8 | config/log/ckpt |
VMamba-S[s2l15 ] |
70M | 384G | MaskRCNN@3x | 49.9 | 70.9 | 54.7 | 44.20 | 68.2 | 47.7 | config/log/ckpt |
VMamba-T[s1l8 ] |
50M | 271G | MaskRCNN@3x | 48.8 | 70.4 | 53.50 | 43.7 | 67.4 | 47.0 | config/log/ckpt |
- Models in this subsection is initialized from the models trained in
classfication
. - we now calculate FLOPs with the algrithm @albertgu provides, which will be bigger than previous calculation (which is based on the
selective_scan_ref
function, and ignores the hardware-aware algrithm).
Backbone | Input | #params | FLOPs | Segmentor | mIoU(SS) | mIoU(MS) | configs/logs/logs(ms)/ckpts |
---|---|---|---|---|---|---|---|
Swin-T | 512x512 | 60M | 945G | UperNet@160k | 44.4 | 45.8 | -- |
Swin-S | 512x512 | 81M | 1039G | UperNet@160k | 47.6 | 49.5 | -- |
Swin-B | 512x512 | 121M | 1188G | UperNet@160k | 48.1 | 49.7 | -- |
Vanilla-VMamba-T | 512x512 | 55M | UperNet@160k | 47.3 | 48.3 | config/log/log(ms)/ckpt | |
Vanilla-VMamba-S | 512x512 | 76M | UperNet@160k | 49.5 | 50.5 | config/log/log(ms)/ckpt | |
Vanilla-VMamba-B | 512x512 | 110M | UperNet@160k | 50.0 | 51.3 | config/log/log(ms)/ckpt | |
VMamba-T[s2l5 ] |
512x512 | 62M | 948G | UperNet@160k | 48.3 | 48.6 | config/log/log(ms)/ckpt |
VMamba-S[s2l15 ] |
512x512 | 82M | 1028G | UperNet@160k | 50.6 | 51.2 | config/log/log(ms)/ckpt |
VMamba-B[s2l15 ] |
512x512 | 122M | 1170G | UperNet@160k | 51.0 | 51.6 | config/log/log(ms)/ckpt |
VMamba-T[s1l8 ] |
512x512 | 62M | 949G | UperNet@160k | 47.9 | 48.8 | config/log/log(ms)/ckpt |
- Models in this subsection is initialized from the models trained in
classfication
. - we now calculate FLOPs with the algrithm @albertgu provides, which will be bigger than previous calculation (which is based on the
selective_scan_ref
function, and ignores the hardware-aware algrithm).