Multimodal AI

Visual Persuasion: What Influences Decisions of Vision-Language Models?

MManuel CherepPPranav M RPPattie MaesNNikhil Singh
Published
February 17, 2026
Authors
4
Word Count
13,781
Code
Includes code

Vision-language models have exploitable visual preferences that shift decisions by 20-40 percent.

Abstract

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

Key Takeaways

  • 1

    Vision-language models exhibit systematic visual preferences exploitable through naturalistic image edits shifting choices by 20-40%.

  • 2

    Current VLM testing focuses on accuracy and identification rather than behavioral evaluation in consequential decision-making scenarios.

  • 3

    Researchers adapted prompt optimization methods to systematically explore visual utility functions embedded in vision-language models.

Limitations

  • Exhaustive pairwise comparisons on massive datasets with natural variation are expensive, slow, and may miss critical feature combinations.

  • The space of possible visual features is essentially infinite, making comprehensive probing of all dimensions impractical.

Keywords

vision-language modelsvisual utilityrevealed preferencevisual prompt optimizationimage generation modelchoice probabilityinterpretability pipelinevisual vulnerabilitiessafety concerns

More in Multimodal AI

View all
Visual Persuasion: What Influences Decisions of Vision-Language Models? | Paperchime