Multimodal AI

Uncertainty-Aware Vision-Language Segmentation for Medical Imaging

AAryan DasTTanishq RachamallaKKoushik BiswasSSwalpa Kumar RoyVVinay Kumar Verma
Published
February 16, 2026
Authors
5
Word Count
6,269

Uncertainty-aware vision-language model combines medical images with clinical text for safer segmentation.

Abstract

We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: https://github.com/arya-domain/UA-VLS

Key Takeaways

  • 1

    Vision-language models can reduce annotation burden by combining medical images with clinical reports for segmentation.

  • 2

    Uncertainty estimation is critical for clinical safety, enabling radiologists to identify ambiguous predictions requiring review.

  • 3

    Cross-modal attention mechanisms effectively align visual and textual features for improved medical image understanding.

Limitations

  • Most existing vision-language models treat modalities separately without effective fusion strategies.

  • Traditional approaches ignore uncertainty quantification despite its fundamental importance to clinical safety.

Keywords

Modality Decoding Attention BlockState Space Mixercross-modal fusionlong-range dependency modellingSpectral-Entropic Uncertainty Lossvision-language medical segmentationuncertainty modellingmultimodal segmentation

More in Multimodal AI

View all
Uncertainty-Aware Vision-Language Segmentation for Medical Imaging | Paperchime