• Anonymous authors

Media Coverage

We have hidden our media coverage to maintain anonymity.

Paper Overview

Vision-Language Models (VLMs), with their strong reasoning and planning capabilities, are widely used in embodied decision-making (EDM) tasks such as autonomous driving and robotic manipulation. Recent research has increasingly explored adversarial attacks on VLMs to explore their vulnerabilities. However, these attacks either rely on overly strong assumptions, requiring full knowledge of the victim VLM, which is impractical for attacking VLM-based EDM systems, or exhibit limited effectiveness. The latter disrupts the system’s perception of most semantics in the image, thus interrupting the reasoning process of the VLM due to the inconsistency between perception and the task context defined by system prompts, and resulting in invalid outputs that fail to influence interactions with the physical world.

To this end, we propose a fine-grained adversarial attack framework, AdvEDM, which leverages the vision-text encoder in VLM-based EDM systems as the surrogate model, as it is typically pre-trained and easy for the adversary to access. Specifically, AdvEDM modifies the VLM's perception of only a few key objects while preserving the semantics of the remaining regions. This attack significantly reduces conflicts with the task context defined by the system prompts, making VLMs output valid but incorrect decisions and affecting the actions of entities, thus leading to a more substantial safety threat in the physical world. We design two variants of based on this framework, AdvEDM-R and AdvEDM-A, which are used to respectively remove the semantics of a specific object from the image and add the semantics of a new object into the image. The experimental results in both general scenarios and EDM tasks indicate excellent fine-grained control and attack effectiveness.

overview

Figure 1. (Overview) Comparison of our attack framework with existing works in attacking VLM-based EDM systems. Existing attacks disrupt most of the semantics in the original image, causing the VLM to generate invalid responses. In contrast, our attack selectively alters the VLM’s perception of a specific object while preserving the semantic integrity of other regions. As a result, the VLM produces valid yet incorrect decisions, effectively influencing the system’s interaction with the physical world.

overview

Figure 2. The pipeline of our methods AdvEDM-R and AdvEDM-A.

Hardware setup

overview

Figure 3. For the UR3e robotic arm, we deploy our VLM-based EDM system in it.

Demo of our EDM system's intelligence

description of the gif
(a)"Catch that penguin in my bowl"
description of the gif
(b)"Put that book that is about to fall onto the table"
description of the gif
(c)"Place the blocks onto United States territory"

Our attack demos

overview

Figure 4. The visualization results of our attacks in the autonomous driving decision-making task.

overview

Figure 5. The visualization results of our attacks in the robotic manipulation task.

Demo 1. AdvEDM-R in embodied VQA task. Our attack target is removing the semantics of Black Swan.

Demo 2. AdvEDM-R in robotic manipulation task. Our attack target is removing the semantics of Black Swan.

Demo 4. AdvEDM-A in embodied VQA task. Our attack target is injecting the semantics of Apple.

Demo 4. AdvEDM-A in robotic manipulation task. Our attack target is injecting the semantics of Apple.