Add Vision Transformer from ScaleMAE #422

anwai98 · 2024-11-29T19:31:41Z

This PR adds the vision transformer integration from ScaleMAE and plugs it into the UNETR backbone (thanks to @Mareike79 for the implementation)

A few details to share with @Mareike79:

In our previous setup, the dimension mismatch came from the issue that the outputs from the attention heads were flattened. They need to be restored to the target patch shape. This takes care of the issue we discussed.
In addition, the model loading with your setup did not exactly work. Also, there were some parts of the pretrained model which are necessary to be dropped out (eg. decoder-related parameters, FPN and FCN heads, etc.)
I added the configurations for vit_b, vit_l and vit_h (from what I understand, we have pretrained the model on vit_b and would be interested to observe the downstream task now)
I disconnected the vision transformer class from the ScaleMAE repo because they do not have installation support for their module, which makes it difficult to fetch submodules from their original repo (eg. functions like CustomCompose, get_2d_sincos_pos_embed_with_resolution, etc).

PS. I'll leave this PR open for now. We work on this branch at the moment, and merge it later if everything works as desired.

PPS. This is still a work-in-progress.

This fixes the segmentation now. And the results look as expected!

anwai98 added 2 commits November 29, 2024 20:16

Add vision transformer of scale-mae

7b60d6d

Restore vit_b default params

14a0723

anwai98 requested a review from constantinpape November 29, 2024 19:31

anwai98 and others added 3 commits November 29, 2024 20:52

Add support for using skip connections

dccdc7e

Merge branch 'main' into vit-scale-mae

f66c6d0

Fix scale to (1.0, 1.0) to avoid cropping images at all

f69eb16

This fixes the segmentation now. And the results look as expected!

Provide feedback