Learning to See While Learning to Act:
Diffusion Models for Active Perception in Robot Imitation

Anonymous Authors
Under Review

Overview

See2Act teaser

See2Act couples seeing and acting in a single denoising loop. At each step the camera view (see: highlighted region) and the predicted action (act: shown arrow) are refined together: the action sets the next viewpoint, and the new viewpoint denoises the next action. This results in a policy that repositions the camera to reveal placement positions initially hidden from overhead views, achieving 95% zero-shot sim-to-real transfer on tasks with occlusion.

Abstract

Most imitation learning methods assume full observability in table-top settings. In practice, objects are often occluded, requiring robots to both search and act, and learning this coupled behavior from limited demonstrations remains challenging. We propose See2Act, an imitation learning approach that conditions action prediction on a sequence of actively-inferred viewpoints at test time, by coupling action denoising with viewpoint refinement. The policy is trained using camera poses anchored to keyframe actions from offline demonstrations, enabling implicit learning of where to see, while learning how to act. We empirically demonstrate that in Ravens the policy recovers informative viewpoints under severe occlusions, and on RLBench tasks it improves performance by up to 33% over prior methods. In the real world, we collect 50 demonstrations in a digital twin and achieve zero-shot sim-to-real transfer on pick-and-place tasks using depth observations. The policy handles significant occlusions, showing that learned viewpoint reasoning enables robust manipulation under partial observability.

Method

See2Act method overview

Left: Training. Given demonstrations -- extracted object states s and keyframe actions a0 -- we sample diffusion timesteps t, compute the camera poses C0, …, CT, render observations Ot in a digital twin, and generate noisy actions รขt by transforming a0 into each camera frame and adding Gaussian noise. The visual encoder and score network are trained jointly to predict the noise ε˜ with an MSE loss.
Right: Inference. Starting from an overview view (t = T), See2Act iteratively refines both the action and the camera pose: at each denoising step it captures an observation, predicts the noise, updates the action estimate, and computes the next camera pose. The final action is executed by the robot.

Ravens Benchmark

A diagnostic benchmark for tasks with occlusion. The base task is fully observable; three variants add front-view occlusions. Hover a card for details.

Place Red in Green
Place-Red-in-Green
Pick the red block and place it in the green bowl, with colored blocks and bowls as distractors. The base, fully-observable diagnostic task.
Bin Picking
Bin-Picking
The red block is initialized inside a single open-topped bin, invisible from the front view -- the policy must look inside to locate it.
Put Within Shelf
Put-Within-Shelf
The red block and green bowl are hidden on the top and bottom levels of a shelf, respectively, requiring views into both occluded levels.
Bin Search
Bin-Search
The red block is placed at random inside one of three open-topped bins, while the green bowl stays visible on the table -- the policy must search the bins.

See2Act achieves the highest success rate on all four tasks (96%, 96%, 72%, 100%), substantially outperforming all baselines, especially on the more heavily occluded variants where prior methods largely fail.

Real Robot Setup

Real-world experimental setup

Real-world tasks showing initial, pick, place states. Blue region shows the area occluded from the camera at its initial overhead pose. Arrows indicate transition from initial to goal.

Real Robot Results

See2Act -- Clean Up (Walkthrough)

Detailed walkthrough of our flagship task with step-by-step explanations.

Comparison of See2Act (ours) against diffusion-based multiview baselines on real-robot tasks with occlusion.

Shelf-Kitting
A high-precision insertion task: a base object, initially hidden inside a shelf, must be retrieved and inserted into its matching kit receptacle within 1-2 mm positional and 1-2° rotational tolerance. Success is releasing the base within this tolerance.
See2Act (Ours)
Diffusion-MV
Diffusion-MV-Entropy
Clean-Up
The robot returns two wheels and one base from randomized table locations to an occluded shelf, repositioning its camera to identify valid placement regions. To remove pick-order ambiguity, training labels follow the object closest to the view center at each step.
See2Act (Ours)
Diffusion-MV
Diffusion-MV-Entropy

See2Act reaches 95% on both shelf-kitting and clean-up from only 50 demonstrations, while Diffusion-MV (25%/50%) and Diffusion-MV-Entropy (50%/50%) need 6,000x more training data. By repositioning its camera close to the target as the action denoises, See2Act gets the close-up views needed for tight-tolerance insertion and stays robust to sim-to-real calibration error, which fixed-view baselines cannot match.

For more details and insights, please refer to the paper.

Limitations and Conclusion

We presented See2Act, a diffusion-based visuomotor imitation learning framework that jointly denoises actions and full 6-DoF camera trajectories, enabling active perception to resolve occlusions that fixed-view and zoom-only methods cannot handle. Our approach achieves strong performance on tasks requiring active search and precise placement, and transfers zero-shot from simulation to a real robot using depth observations, reaching 95% success with only 50 demonstrations. Requiring a new observation at every denoising step comes with a latency cost, making execution dependent on camera latency and slower than fixed-view methods. We also foresee extending See2Act to dense trajectory prediction as promising future work, expanding the benefits of our framework to dexterous and dynamic manipulation.