This was something I found uncomfortable while trying to understand Vision Language Action models (VLA): I “knew” all the individual components, CLIP, transformers, diffusion, robot control, but I didn’t actually have a clean mental model of why this worked.
Given an image, a language instruction, and the robot’s current state, what exactly is fed into the diffusion model, and what is the model learning to predict? I am not trying to propose a detailed architecture here, but provide a mental-map/intuition of what is going on. (PSA: This describes the Diffusion Policy-style architecture. Other VLAs like Other VLAs like RT-2 integrate action prediction more tightly into a pretrained Vision Language Model (VLM) backbone.)
At a high level, a VLA model is a policy. It takes in vision, language, and state, and outputs actions.
But these inputs are in different spaces. Images are in pixels, language instructions are text tokens and the robot state is a vector of joint values. Before we can do anything with them, we need to represent them as embeddings.
So, to do this:
Vision answers: “What is where?” Language answers: “What is the task?" State answers: “What can I physically reach or do right now?" For action prediction, we meed to be able to represent "Which object mentioned in the instruction is reachable given the current joint configuration?". To this end, all of them need to live in a shared latent space where they can influence each other through attention.
#Each embedding is projected into a common dimension, say 256.
proj_image = Linear(2048, 256)(image_embedding)
proj_text = Linear(512, 256)(text_embedding)
proj_state = Linear(64, 256)(state_embedding)
<h2 id=representing-the-input-better>Representing the input better</h2><p>Now, before it reaches the diffusion steps, we need a way to “represent” these three inputs better. The vectors from these different input sources aren’t contextualized by each other.</p>
To make them interact, the model projects all of these embeddings using learned linear layers. Now that they have the same dimensionality, we are stack them together and pass them through an attention layer (over-simplification).
## This is an over-simplification
# Stack into a sequence: [image_token, text_token, state_token]
tokens = stack([proj_image, proj_text, proj_state]) # shape: (3, 256)
Q = tokens @ W_q
K = tokens @ W_k
V = tokens @ W_v
attention = softmax(Q @ K.T / sqrt(d))
fused_tokens = attention @ V
After this, the image representation becomes aware of the language instruction and the robot’s current configuration. The language representation becomes grounded in what the robot sees. The robot joint state representation becomes contextualized by both the scene and the task. These vectors are no longer purely visual, linguistic, or proprioceptive (fancy word for robot's "self-awareness" of where its parts are in space).
How does this reach the diffusion model? There are two common approaches:
Cross-attention: The fused tokens are passed as-is to the diffusion model, which attends to them via cross-attention layers. The diffusion model queries these context tokens when deciding how to denoise the action.
Pooled conditioning: The fused tokens are pooled (e.g., averaged) into a single context vector, which is injected into the diffusion model via concatenation (or something fancier).
# Option 1: Pass full sequence to diffusion model (cross-attention)
context_tokens = fused_tokens # shape: (3, 256)
# Option 2: Pool into a single vector (simpler conditioning)
context_vector = mean(fused_tokens, dim=0) # shape: (256,)
This is the conditioning signal for the diffusion model.
In practice, this fused representation is usually processed by a deep transformer backbone, either explicitly before the diffusion model or implicitly inside the diffusion network via cross-attention. This backbone is where multi-step reasoning and grounding occur: language tokens attend to image regions, state tokens inform reachability, and the model builds a task-aware context. The diffusion model then uses this contextualized representation to iteratively denoise actions.
<h2 id=generating-or-learning-actions>Generating (or learning) actions</h2><p>In diffusion-based VLAs, the diffusion model operates over actions, not states. The state is known and fixed for a given decision. The action is what we are trying to get. The VLA outputs actions (not states). The robot state is a measurement. It tells the model where the robot is right now. The action is a command. It tells the robot what to do next.</p>
In practice, most diffusion-based VLAs do not model a single action, but a short action trajectory or action chunk over a fixed horizon. That is, the random variable being diffused is typically
and at inference time only the first action (or a small prefix) is executed before replanning.
# Robot state: current joint angles (read from encoders)
robot_state = [
joint1_angle, # e.g., 0.5 radians
joint2_angle, # e.g., -0.3 radians
joint3_angle, # e.g., 1.2 radians
joint4_angle, # e.g., 0.0 radians
joint5_angle, # e.g., -0.8 radians
joint6_angle, # e.g., 0.4 radians
joint7_angle, # e.g., 0.1 radians
gripper_pos # e.g., 1.0 (open)
]
# Action: velocity or position delta commands
action = [
delta_joint1, # e.g., +0.1 (move joint 1 forward)
delta_joint2, # e.g., -0.05 (move joint 2 back)
delta_joint3, # e.g., 0.0 (hold position)
delta_joint4, # e.g., 0.0
delta_joint5, # e.g., 0.0
delta_joint6, # e.g., 0.0
delta_joint7, # e.g., 0.0
gripper_cmd # e.g., 1.0 (close gripper)
]
In most implementations, actions are normalized, relative (deltas rather than absolute targets), and often represent velocities or end-effector twists to keep the action distribution smooth and well-behaved under Gaussian noise.
Diffusion training works by corrupting clean data with noise and teaching a model to reverse that corruption. In this case, the clean data is an expert action from a demonstration. During training, Gaussian noise is added to this action according to a noise schedule (The noise is known. We sample it ourselves). Early in the process, the action is almost clean. Later, it becomes nearly pure noise.
# a_0: clean expert action
# eps: Gaussian noise
# alpha_t: noise schedule (ontrols how quickly this transition happen)
a_t = sqrt(alpha_t) * a_0 + sqrt(1 - alpha_t) * eps
“time” appears in three distinct ways here: (1) the diffusion timestep (noise level), (2) the control timestep within the action sequence, and (3) the observation history (past images and states).
The training objective is simple: given the noisy action, the timestep, and the context vector, the model is asked to predict the noise that was added.
# Assume 1000 steps in the denoising process
# Clean action (from expert demonstration)
a_0 = [0.1, -0.05, 0.0, 0.0, 0.0, 0.0, 1.0] # 7 numbers
# Random noise (sampled from standard normal)
ε = [0.3, -0.2, 0.5, -0.1, 0.4, -0.3,0.2] # 7 numbers
# At t=500 (midway through diffusion), α_t ≈ 0.5
a_t = sqrt(0.5) * a_0 + sqrt(0.5) * ε
# = 0.707 * [0.1, -0.05, ...] + 0.707 * [0.3, -0.2, ...]
# = [0.28, -0.18, 0.35, -0.07, 0.28, -0.21, 0.85]
# At t=1000 (end): a_t ≈ ε (pure noise)
# At t=0 (start): a_t = a_0 (clean action)
By learning to do this well across many timesteps and many contexts, the model implicitly learns the conditional distribution of actions given vision, language, and state.
for batch in data:
image, instruction, robot_state, a_0 = batch
# Encode and fuse context
context = encode_and_fuse(image, instruction, robot_state)
# Sample noise
eps = torch.randn_like(a_0)
# Sample diffusion timestep
t = randint(0, T)
# Create noisy action
a_t = sqrt(alpha_t) * a_0 + sqrt(1 - alpha_t) * eps
# Predict the noise
eps_pred = model(a_t, context, t)
# Train to recover the noise
loss = MSE(eps_pred, eps)
At inference time, the setup is similar (but inverted and there is no expert action). We use the model as a learned denoiser that is guided by context.
We have a new state now. Rinse and repeat (like in the image up top)
a = torch.randn(action_dim)
for t in reversed(range(T)):
eps_pred = model(a, context, t)
a = denoise_step(a, eps_pred, t)
The core problem with naive behavior cloning (supervised learning from demonstrations) is multimodality. When different demonstrators push an object left versus right to achieve the same goal, a policy trained with an MSE loss learns the average behavior, which might correspond to pushing straight into the object. This averaging problem is catastrophic for robot control.
Diffusion policies avoid this by modeling an implicit distribution over action trajectories rather than regressing a single action. At inference time, the model samples and commits to one valid mode instead of averaging across modes. This is analogous to image diffusion models, which generate sharp, distinct samples rather than blurry averages.
Specific implementations vary on whether they use transformers or other sequence models, DDPM or flow matching, but the core idea remains the same.
tldr: Raw inputs come in as an image, a language instruction, and the robot’s current joint configuration. Each modality is encoded, projected into a shared space, and fused. The resulting context representation conditions a diffusion model that operates over actions (typically short action trajectories rather than single control commands). During training, the model learns to predict known noise added to expert actions. During inference, it starts from noise and iteratively denoises to produce a valid action.
VLA / Robotics:
Diffusion Policy: Chi et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion," RSS 2023. diffusion-policy.cs.columbia.edu | arXiv:2303.04137
RT-2: Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control," 2023. robotics-transformer2.github.io | arXiv:2307.15818
Diffusion Foundations:
Tutorials: