Multimodal Learning for Medicine & Healthcare The Challenges and Opportunities (IN2107)
Doctors typically make clinical decisions using several modalities, such as images, language, or tabular data. Deep learning models offer a powerful framework for integrating these heterogeneous modalities to support automated and data-driven decision making. However, effective multimodal learning remains non-trivial, facing challenges such as under-optimization across modalities [1] or the presence of missing data [2]. These challenges are particularly pronounced in real-world clinical settings, where data availability, quality, and alignment across modalities vary substantially. In this seminar, we will discuss recent advances in multimodal learning, covering key paradigms such as fusion and alignment mechanisms, self-supervised and contrastive pretraining across modalities, and the emergence of multimodal foundation models for medical AI. We will also examine strategies that address real-world challenges, including handling missing or noisy modalities, improving cross-modal generalization, and enhancing data efficiency and robustness for clinical applications.
References [1] Shicai Wei, Chunbo Luo, and Yang Luo. Boosting multimodal learning via disentangled gradient learning. arXiv preprint arXiv:2507.10213, 2025. [2] Sijie Li, Chen Chen, and Jungong Han. Simmlm: A simple framework for multi-modal learning with missing modality. arXiv preprint arXiv:2507.19264, 2025. [3] Wu, Zhenbang, et al. “Multimodal patient representation learning with missing modalities and labels.” The Twelfth International Conference on Learning Representations. 2024. [4] Yun, Sukwon, et al. “Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts.” Advances in Neural Information Processing Systems 37 (2024): 98782-98805. [5] Zhang, Kai, et al. “A generalist vision–language foundation model for diverse biomedical tasks.” Nature Medicine 30.11 (2024): 3129-3141. [6] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PmLR, 2021. [7] Ma, Jun, et al. “Segment anything in medical images.” Nature Communications 15.1 (2024): 654. [8] Li, Songtao, and Hao Tang. “Multimodal alignment and fusion: A survey.” arXiv preprint arXiv:2411.17040 (2024).
Key topics to be covered include:
- Introduction to multimodal learning in medicine
- Challenges of multimodal learning in clinical applications, including missing and noisy data
- Multimodal pretraining for medicine
- State-of-the-art methods
Requirements:
- Background in image processing and machine learning/deep learning
- Interest in medical multimodal learning
- Interest in research
Please register via the TUM matching system: https://matching.in.tum.de
Check the intro slides here: