Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

  • Soumya Suvra Ghosal
  • , Souradip Chakraborty
  • , Vaibhav Singh
  • , Tianrui Guan
  • , Mengdi Wang
  • , Ahmad Beirami
  • , Furong Huang
  • , Alvaro Velasquez
  • , Dinesh Manocha
  • , Amrit Singh Bedi

Research output: Contribution to journalConference articlepeer-review

Abstract

With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks. In this work, we first highlight an important safety gap to describe that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safety reward model through controlled decoding to defend against jailbreak attacks. Additionally, we provide a mathematical characterization of Immune, offering insights on why it improves safety against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

Original languageEnglish (US)
Pages (from-to)25038-25049
Number of pages12
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs
StatePublished - 2025
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, United States
Duration: Jun 11 2025Jun 15 2025

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition

Keywords

  • ai alignment
  • jailbreak attacks
  • multi-modal large language model

Fingerprint

Dive into the research topics of 'Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment'. Together they form a unique fingerprint.

Cite this