This was something I found uncomfortable while trying to understand Vision Language Action models (VLA): I “knew” all the individual components, CLIP, transformers, diffusion, robot control, but I didn’t actually have a clean mental model of why this worked.

Given an image, a language instruction, and the robot’s current state, what exactly is fed into the diffusion model, and what is the model learning to predict? I am not trying to propose a detailed architecture here, but provide a mental-map/intuition of what is going on. (PSA: This describes the Diffusion Policy-style architecture. Other VLAs like Other VLAs like RT-2 integrate action prediction more tightly into a pretrained Vision Language Model (VLM) backbone.)

At a high level, a VLA model is a policy. It takes in vision, language, and state, and outputs actions.

Inputs

Note for readers new to robotics:

"robot state" means the information the robot knows about its own body, like joint angles and the gripper's position. This data comes directly from the robot's hardware by just reading sensor values (no perception). Each joint in a robot arm has an encoder that continuously reports its angles, so a 7-DOF arm will get a 7-dimensional vector of joint positions. The gripper (end-effector) adds another dimension or two (how open it is, whether it's grasping something). Some setups also include joint velocities or torques.

The camera setup varies a lot depending on the task. A wrist-mounted camera (eye-in-hand) moves with the end effector and is good for close-up manipulation as you see what the gripper sees. A fixed third-person camera mounted above or beside the workspace gives a broader view of the scene but doesn't move. Many real-world setups use both.

An "action" is how the robot actually moves its body. In most manipulation settings, actions are expressed as small changes to the robot’s joints or end-effector over time, like delta of joint angles, joint velocities, or (though not often) a desired gripper pose. For the robot: how should I move right now given what I see, what I was told to do, and where my body currently is.

But these inputs are in different spaces. Images are in pixels, language instructions are text tokens and the robot state is a vector of joint values. Before we can do anything with them, we need to represent them as embeddings.

#Each embedding is projected into a common dimension, say 256.

proj_image = Linear(2048, 256)(image_embedding)
proj_text  = Linear(512, 256)(text_embedding)
proj_state = Linear(64, 256)(state_embedding)

<h2 id=representing-the-input-better>Representing the input better</h2><p>Now, before it reaches the diffusion steps, we need a way to “represent” these three inputs better. The vectors from these different input sources aren’t contextualized by each other.</p>

To make them interact, the model projects all of these embeddings using learned linear layers. Now that they have the same dimensionality, we are stack them together and pass them through an attention layer (over-simplification).

## This is an over-simplification

# Stack into a sequence: [image_token, text_token, state_token]
tokens = stack([proj_image, proj_text, proj_state])  # shape: (3, 256)

Q = tokens @ W_q
K = tokens @ W_k
V = tokens @ W_v

attention = softmax(Q @ K.T / sqrt(d))
fused_tokens = attention @ V

After this, the image representation becomes aware of the language instruction and the robot’s current configuration. The language representation becomes grounded in what the robot sees. The robot joint state representation becomes contextualized by both the scene and the task. These vectors are no longer purely visual, linguistic, or proprioceptive (fancy word for robot's "self-awareness" of where its parts are in space).

# Option 1: Pass full sequence to diffusion model (cross-attention)
context_tokens = fused_tokens  # shape: (3, 256)

# Option 2: Pool into a single vector (simpler conditioning)
context_vector = mean(fused_tokens, dim=0)  # shape: (256,)

<h2 id=generating-or-learning-actions>Generating (or learning) actions</h2><p>In diffusion-based VLAs, the diffusion model operates over actions, not states. The state is known and fixed for a given decision. The action is what we are trying to get. The VLA outputs actions (not states). The robot state is a measurement. It tells the model where the robot is right now. The action is a command. It tells the robot what to do next.</p>

# Robot state: current joint angles (read from encoders)
robot_state = [
    joint1_angle,   # e.g., 0.5 radians
    joint2_angle,   # e.g., -0.3 radians
    joint3_angle,   # e.g., 1.2 radians
    joint4_angle,   # e.g., 0.0 radians
    joint5_angle,   # e.g., -0.8 radians
    joint6_angle,   # e.g., 0.4 radians
    joint7_angle,   # e.g., 0.1 radians
    gripper_pos     # e.g., 1.0 (open)
]

# Action: velocity or position delta commands
action = [
    delta_joint1,   # e.g., +0.1 (move joint 1 forward)
    delta_joint2,   # e.g., -0.05 (move joint 2 back)
    delta_joint3,   # e.g., 0.0 (hold position)
    delta_joint4,   # e.g., 0.0
    delta_joint5,   # e.g., 0.0
    delta_joint6,   # e.g., 0.0
    delta_joint7,   # e.g., 0.0
    gripper_cmd     # e.g., 1.0 (close gripper)
]

Training

Diffusion training works by corrupting clean data with noise and teaching a model to reverse that corruption. In this case, the clean data is an expert action from a demonstration. During training, Gaussian noise is added to this action according to a noise schedule (The noise is known. We sample it ourselves). Early in the process, the action is almost clean. Later, it becomes nearly pure noise.

# a_0: clean expert action
# eps: Gaussian noise
# alpha_t: noise schedule (ontrols how quickly this transition happen)

a_t = sqrt(alpha_t) * a_0 + sqrt(1 - alpha_t) * eps

The training objective is simple: given the noisy action, the timestep, and the context vector, the model is asked to predict the noise that was added.

# Assume 1000 steps in the denoising process
# Clean action (from expert demonstration)
a_0 = [0.1, -0.05, 0.0, 0.0, 0.0, 0.0, 1.0]   # 7 numbers
# Random noise (sampled from standard normal)
ε = [0.3, -0.2, 0.5, -0.1, 0.4, -0.3,0.2]    # 7 numbers

# At t=500 (midway through diffusion), α_t ≈ 0.5
a_t = sqrt(0.5) * a_0 + sqrt(0.5) * ε
#    = 0.707 * [0.1, -0.05, ...] + 0.707 * [0.3, -0.2, ...]
#    = [0.28, -0.18, 0.35, -0.07, 0.28, -0.21, 0.85]

# At t=1000 (end): a_t ≈ ε (pure noise)
# At t=0 (start): a_t = a_0 (clean action)

By learning to do this well across many timesteps and many contexts, the model implicitly learns the conditional distribution of actions given vision, language, and state.

for batch in data:
    image, instruction, robot_state, a_0 = batch

    # Encode and fuse context
    context = encode_and_fuse(image, instruction, robot_state)

    # Sample noise
    eps = torch.randn_like(a_0)

    # Sample diffusion timestep
    t = randint(0, T)

    # Create noisy action
    a_t = sqrt(alpha_t) * a_0 + sqrt(1 - alpha_t) * eps

    # Predict the noise
    eps_pred = model(a_t, context, t)

    # Train to recover the noise
    loss = MSE(eps_pred, eps)

Inference

At inference time, the setup is similar (but inverted and there is no expert action). We use the model as a learned denoiser that is guided by context.

Note.

The core problem with naive behavior cloning (supervised learning from demonstrations) is multimodality. When different demonstrators push an object left versus right to achieve the same goal, a policy trained with an MSE loss learns the average behavior, which might correspond to pushing straight into the object. This averaging problem is catastrophic for robot control.

Diffusion policies avoid this by modeling an implicit distribution over action trajectories rather than regressing a single action. At inference time, the model samples and commits to one valid mode instead of averaging across modes. This is analogous to image diffusion models, which generate sharp, distinct samples rather than blurry averages.

Specific implementations vary on whether they use transformers or other sequence models, DDPM or flow matching, but the core idea remains the same.

tldr: Raw inputs come in as an image, a language instruction, and the robot’s current joint configuration. Each modality is encoded, projected into a shared space, and fused. The resulting context representation conditions a diffusion model that operates over actions (typically short action trajectories rather than single control commands). During training, the model learns to predict known noise added to expert actions. During inference, it starts from noise and iteratively denoises to produce a valid action.

Inputs

Training

Inference

References