Definition: Transfer knowledge between modalities, usually to help the primary modality which may be noisy or with limited resources
Definition: Transferring knowledge from large-scale pretrained models to downstream tasks involving the primary modality.
Tsimpoukelli et al., Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021
Ziegler et al., Encoder-Agnostic Adaptation for Conditional Language Generation. arXiv 2019
Rahman et al., Integrating Multimodal Information in Large Pretrained Transformers. ACL 2020
Liang et al., HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning. arXiv 2022
Reed et al., A Generalist Agent. arXiv 2022
Definition: Transferring information from secondary to primary modality by sharing representation spaces between both modalities.
Socher et al., Zero-Shot Learning Through Cross-Modal Transfer. NeurIPS 2013
Zadeh et al., Foundations of Multimodal Co-learning. Information Fusion 2020
Definition: Transferring information from secondary to primary modality by using the secondary modality as a generation target.
Pham et al., Found in Translation: Learning Robust Joint Representations via Cyclic Translations Between Modalities. AAAI 2019
Tan and Bansal, Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision. EMNLP 2020
Open challenges:
- Low-resource: little downstream data, lack of paired data, robustness (next section)
- Beyond language and vision
- Settings where SOTA unimodal encoders are not deep learning e.g., tabular data
- Complexity in data, modeling, and training
- Interpretability (next section)