Why part 4 is about synthetic data
The first three posts in this series moved from the ownership map to teleoperation and human video. If you are joining here, start with Part 1: the data war landscape, Part 2: teleoperation, and Part 3: human video mining. Part 4 answers the next practical question: if real robot data is expensive, slow, and hard to scale, how far can simulation amplify it before the model starts learning the wrong thing?
Synthetic data for humanoid robots is not just pretty rendered images. A useful synthetic data pipeline produces trajectories with actions, robot state, camera observations, object poses, physics, success labels, and a format that can be mixed with real data for policy training. For a task like "pick up a cup and place it into a tray", good synthetic data does not merely show the policy more cup colors. It must teach hand trajectories, gripper timing, contact with the tray, collision avoidance, and recovery from different initial placements.
The appeal is scale. MimicGen reports generating more than 50,000 new demonstrations across 18 tasks from about 200 source human demonstrations. DexMimicGen extends the idea to bimanual dexterous manipulation, generating more than 20,000 demonstrations from 60 source demonstrations across 9 tasks. NVIDIA's GR00T-Mimic blueprint pushes the workflow toward industrial scale: 780,000 synthetic trajectories, equivalent to 6,500 hours of demonstrations, generated in 11 hours; when combined with real data, NVIDIA reports a 40% GR00T N1 performance improvement over real-only training.
The warning is just as important: synthetic data does not replace reality. It amplifies good demonstrations, explores many variations, and reduces experiment cost. It does not automatically fix wrong physics, weak contact modeling, missing actuator delay, overly clean sensors, or force-heavy tasks such as insertion, screwing, zipping, wiping, and cable handling. The practical question is not "sim or real?". The better question is: how much simulation, which kind of simulation, and where must real data stay in the loop?
The pipeline from one demo to a dataset
Imagine you have a two-arm humanoid, or a mobile manipulator with two end-effectors. You want to teach it: "pick up a bowl, pour small objects into the bowl, and place the bowl on a tray." A minimum synthetic data pipeline looks like this:
1. Collect source demonstrations
- sim teleoperation, real robot teleoperation, or retargeted human motion
2. Segment each demo into subtasks
- reach bowl
- grasp bowl
- move to pouring zone
- pour object
- place bowl on tray
3. Attach each subtask to an object frame
- bowl frame, tray frame, cup frame, drawer handle frame
4. Randomize the scene
- object poses, color, texture, lighting, camera, friction, mass
5. Transform the trajectory
- preserve end-effector relationships to objects in the new scene
6. Replay in simulation
- run the controller, check collisions, check task success
7. Filter the data
- keep successful episodes and discard bad grasps or unsafe collisions
8. Train the policy
- behavior cloning, diffusion policy, VLA fine-tuning, or sim-real co-training
9. Evaluate on the real robot
- measure the gap and add real demos around failure modes
For beginners, the core idea is in steps 3 and 5. Synthetic trajectory generation is not "add noise to actions". If the source demo says "move the right hand to absolute position x=0.42, y=-0.10", replaying that in a new scene fails when the bowl is somewhere else. A MimicGen-style pipeline stores relative motion: "move the hand to this pose relative to the bowl." When the bowl is randomized to a new location, the end-effector trajectory is transformed through the new object frame.
MimicGen: object-centric trajectory transformation
MimicGen is the cleanest example for single-arm manipulation, or tasks that can be described as a clear sequence of subtasks. The paper models each task as a sequence of object-centric subtasks. A source demonstration is split into contiguous segments, and each segment is tied to a reference object. When a new scene is randomized, the system reads the new object pose, computes the transform from the source object frame to the new object frame, applies that transform to the end-effector trajectory, replays it in simulation, and keeps the demonstration only if the task succeeds.
You can read the pipeline as a small program:
for episode_id in range(target_num_episodes):
scene = randomize_scene(objects, cameras, physics)
source_demo = sample(source_demos)
generated_actions = []
for subtask in source_demo.subtasks:
ref_object = scene.object(subtask.object_name)
delta = ref_object.pose @ inverse(subtask.source_object_pose)
segment = transform_trajectory(subtask.ee_poses, delta)
segment = bridge_from_current_robot_pose(segment)
generated_actions.extend(segment)
rollout = execute_open_loop(scene, generated_actions)
if success(rollout) and no_bad_collision(rollout):
save(rollout)
The important point is that MimicGen does not require thousands of people driving robots. It needs a smaller set of clean source demonstrations, correct subtask annotations, reliable object poses, and a simulator good enough to replay the generated motion. According to the OpenReview page, the authors generated more than 50,000 demonstrations for 18 tasks from about 200 human demos, then trained imitation learning agents for long-horizon and high-precision tasks such as assembly and coffee preparation.
For a small team, MimicGen is a good fit when the task has these properties:
| Condition | Good for synthetic data? | Why |
|---|---|---|
| Object pose is available from sim or perception | High | Trajectory transforms need reliable object frames |
| The task can be split into clear subtasks | High | Reach, grasp, move, and place are easy to annotate |
| Contact is short and simple | Medium high | Pick-place is easier than insertion or screwing |
| Objects are deformable or cable-like | Low | Most simulators struggle with deformation fidelity |
| Success can be checked with rules | High | Automatic filtering removes many bad episodes |
A common mistake is scaling too early. If 10 source demos contain poor grasp poses, 10,000 synthetic demos will simply teach the policy the poor behavior faster. Audit the source demonstrations before generation: replay each subtask, inspect cameras, check gripper state, verify object poses, and label failure cases.
DexMimicGen: when humanoids need two hands
A humanoid is not just a robot arm with legs attached. Many kitchen and warehouse tasks require two hands: one hand stabilizes a tray while the other places a cup; one hand opens a drawer while the other removes an item; both hands lift a box. This is why DexMimicGen exists.
DexMimicGen inherits the core MimicGen idea but adds three subtask types for bimanual dexterous manipulation:
| Subtask type | Example | Main challenge |
|---|---|---|
| Parallel | Each hand picks a different object | The two arms do not need to finish at the same time |
| Coordination | Both hands lift a tray | Relative pose between end-effectors must be preserved |
| Sequential | Pour into a bowl, then place the bowl | One arm may need to wait for the other |
In the paper, DexMimicGen uses per-arm segmentation, asynchronous action queues for parallel subtasks, synchronization for coordination subtasks, and ordering constraints for sequential subtasks. The numbers are important: the team generated about 21,000 demos from 60 source demos across 9 tasks; for a real2sim2real can-sorting task, the final visuomotor policy achieved 90% success, compared with 0% from using only the source demos.
For humanoids, the lesson is not merely "one demo becomes one thousand demos." The real lesson is task structure. If the task needs two arms, a single linear annotation sequence will corrupt the data. You need to know when the arms are independent, when they must stay synchronized, and when one arm must wait. This is also where data ownership becomes interesting: the source demos may be few, but the final value comes from the annotation schema, simulator, controller, and success filter.
GR00T-Mimic inside Isaac Lab
NVIDIA brings this idea closer to a production workflow with Isaac Lab Mimic and GR00T-Mimic. The Isaac Lab documentation describes collecting or loading recorded demonstrations, annotating subtasks, generating additional demonstrations, and training a policy. The docs also show Isaac Lab Mimic support for robots with multiple end-effectors, including a Fourier GR-1 humanoid pick-and-place example. One very practical detail: generated candidate demonstrations do not always succeed. The pipeline must filter them by success criteria.
The GR00T-Mimic flow can be understood like this:
source:
demo_type: teleoperation
device: Apple Vision Pro, SpaceMouse, keyboard, or prerecorded sample
format: HDF5 or LeRobot-like episode data
annotation:
subtasks:
- left_hand_pick_object
- right_hand_wait
- handover_or_place
method: heuristic_or_manual
generation:
simulator: Isaac Lab on Isaac Sim
mimic: transform_and_stitch_subtask_segments
randomization:
visual: [lighting, material, camera, background]
physics: [mass, friction, damping, actuator_delay, sensor_noise]
training:
policy: BC, diffusion policy, or GR00T-style VLA post-training
mixture: real + synthetic
GR00T-Mimic should not be understood as a magic "generate data" button. The task environment must expose the right APIs: get end-effector pose, convert target pose to action, get object poses, read gripper actions, and define success. Without those functions, the system cannot transform old subtask segments into valid motions in new scenes.
If you have read our GR00T N1 + G1 data collection in Isaac Lab guide, treat GR00T-Mimic as the next stage after collecting source demonstrations. For broader Isaac Lab fundamentals, see Isaac Lab for robot learning from scratch.
RoboCasa: kitchens as a synthetic benchmark
RoboCasa matters because it places synthetic data inside environments that look more like real homes: kitchen scenes. The RoboCasa365 paper presents 365 tasks, 2,500 kitchen environments, more than 600 hours of human demonstrations, and more than 1,600 hours of synthetic demonstrations generated with MimicGen. That scale is directly relevant to the question: what can a generalist robot learn from household simulation?
For beginners, RoboCasa is useful at three levels:
| Layer | What it teaches |
|---|---|
| Scene diversity | Cabinets, sinks, shelves, drawers, and counters are not just background |
| Task diversity | Simple pick-place is not enough to represent household work |
| Benchmark discipline | You need unseen objects, unseen layouts, and unseen tasks to measure generalization |
If you train in one beautiful kitchen scene, the policy may learn shortcuts: countertop color, camera position, or the default bowl location. A RoboCasa-style setup forces scene randomization and evaluation on unseen environments. This is the most valuable part of synthetic data: not rendering images that look real, but creating many task configurations so the model depends less on a single layout.
Domain randomization: what should change?
Domain randomization is based on a simple idea: instead of making simulation match reality perfectly, randomize the parameters that may vary in the real world so the policy becomes robust. NVIDIA's Physical AI course lists common groups: object colors, textures, materials, lighting, camera pose, background, plus physics parameters such as mass, friction, restitution, damping, actuator delays, and sensor noise.
A practical starting table:
| Group | Parameter | Starting range |
|---|---|---|
| Object placement | x, y, yaw | Enough to cover the reachable workspace |
| Visual | texture, material, color | Diverse, but keep the object class recognizable |
| Lighting | intensity, direction | Cover normal shadows and glare |
| Camera | extrinsic jitter, FOV | Small for fixed cameras, larger for head cameras |
| Contact | friction, restitution | Use measured values if available |
| Robot | joint damping, delay, action noise | Start small, then increase after the policy works |
| Sensor | RGB noise, depth dropout | Model real camera errors, not only Gaussian noise |
The practical rule: randomize visual factors more aggressively first, and randomize physics more carefully. Visual randomization that is too broad can make policies conservative, but they often still learn. Physics randomization that is wrong can create actions that do not transfer to the real robot, especially in contact-rich tasks.
What sim-real ratio should you use?
There is no universal optimum. The paper Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation shows that co-training with synthetic simulation data can significantly improve real-world policies. The project page summarizes the recipe: use task-aware digital cousins, include diverse objects and placements, keep success criteria aligned with the real task, prefer similar camera viewpoints when possible, and use substantially more simulation than real data while tuning the sampling ratio.
A practical starting point for a small lab:
| Real data available | Sim sampling ratio in minibatches | When to use |
|---|---|---|
| 10-30 real demos | 80-95% sim | New task, very little real data, need a broad prior |
| 50-200 real demos | 60-85% sim | Failure logs exist, need more robustness |
| 200-500 real demos | 40-70% sim | Real data is decent, sim covers edge cases |
| More than 500 real demos | 20-60% sim | Sim should focus on rare or unsafe variations |
This is a heuristic, not a law. The co-training paper uses settings such as 20 real demos and 1,000 simulation demos for CupPnP to study the sampling ratio, and emphasizes that the ratio must be tuned. In practice, run a small sweep:
experiments:
- name: real_only
sim_sampling_ratio: 0.0
- name: sim25_real75
sim_sampling_ratio: 0.25
- name: sim50_real50
sim_sampling_ratio: 0.50
- name: sim75_real25
sim_sampling_ratio: 0.75
- name: sim90_real10
sim_sampling_ratio: 0.90
metrics:
- real_success_rate_seen_objects
- real_success_rate_unseen_objects
- contact_failure_rate
- recovery_after_misgrasp
- average_task_time
Do not select the ratio using simulation validation alone. Select it with real robot trials, ideally at least 20-50 trials per condition. If sim90_real10 improves success but makes the robot slow, hesitant, or overly contact-averse, your physics randomization may be too broad.
Walkthrough: building a synthetic dataset for a kitchen task
Suppose the task is "pick a can from the counter, place it into a tray, then pull the tray toward the table edge." This is a reasonable MimicGen or DexMimicGen task because it has clear object frames and contact is not too complex.
Task: can_to_tray_then_pull_tray
Robot: humanoid upper body, two arms, parallel grippers
Observation: head RGBD + wrist cameras + proprioception
Action: left/right end-effector pose delta + gripper command
Success:
- can inside tray
- tray final pose within target zone
- no can drop
- no unsafe table collision
Implementation steps:
-
Collect 20 source demos through teleoperation. Do not optimize for speed. Each demo should succeed, move smoothly, and keep safe clearance from distracting objects.
-
Segment the subtasks:
subtasks:
left_arm:
- idle_until_can_in_tray
- grasp_tray_handle
- pull_tray_to_target
right_arm:
- reach_can
- grasp_can
- place_can_in_tray
- release_can
-
Define object frames:
can,tray,tray_handle, andtarget_zone. If the real perception stack cannot estimate these frames, the synthetic pipeline will not transfer well. -
Randomize the scene: can pose, tray pose, can color, table texture, lighting, camera jitter, and friction between tray and table.
-
Generate 5,000 candidate episodes. For a bimanual task, you may keep only 1,000-2,000 successful episodes if the success rate is low.
-
Train three baselines: real-only, synthetic-only, and sim-real co-training.
-
Test on the real robot. Label failures by group: missed grasp, can slip, tray stuck, collision, visual confusion, timeout.
-
Add real demonstrations around the failure modes, not just more easy demos.
The last point matters most. Synthetic data is best used as a loop: simulation creates coverage, real testing reveals the gap, simulation ranges or assets are updated, and real demonstrations are added where the simulator cannot model the task.
Limits: physics fidelity and contact-rich tasks
Synthetic data often fails where the robot has to "feel" the world. Pick-place with rigid objects can work well. Tight cap opening, USB insertion, cable pulling, surface wiping, cloth folding, small-travel buttons, and sliding thin objects into slots are much harder. The reasons are familiar:
| Limit | Result |
|---|---|
| Simplified contact model | The policy learns the wrong contact angle or force |
| Wrong friction | Objects slide or stick differently than in sim |
| Missing actuator delay | Sim motions are too sharp; the real robot overshoots |
| Overly clean sensors | The policy depends on unrealistic depth or segmentation |
| Incorrect object mesh | A grasp looks correct, but collision geometry is wrong |
| Weak deformable simulation | Cables, cloth, bags, and soft packaging transfer poorly |
For contact-rich tasks, reduce expectations for synthetic-only training. Use simulation to learn approach, search, pre-grasp, and variation. Use real data to learn the final contact phase. A practical recipe:
sim-heavy:
reach, align, coarse grasp, object search, collision avoidance
real-heavy:
insertion, twisting, force closure, sliding contact, tactile correction
hybrid:
train on sim + real, but oversample real frames near contact events
If you have tactile or force-torque sensors, log them from the beginning. Synthetic video can look excellent, but contact policies without force signals often remain brittle.
Synthetic data in the ownership war
In the 2026 humanoid data war, synthetic pipelines create a new ownership layer. Source demos may come from operators. The simulator may come from a vendor. Kitchen assets may have separate licenses. The annotation schema belongs to the robot team. Generated trajectories run on company GPUs. Real evaluation logs may come from customers. Once a model is trained, the value is not in a single file; it is in the whole pipeline.
Teams should manage synthetic data like an engineering product:
dataset_card:
source_demos:
owner: internal_teleop_team
consent: operator_agreement_v2
simulator:
engine: Isaac Lab
version: pinned
assets:
kitchen_scenes: licensed_or_internal
object_meshes: provenance_tracked
generation:
randomization_seed: stored
success_filter: documented
mixture:
real_sampling_ratio: tracked
sim_sampling_ratio: tracked
deployment:
evaluated_on_real_robot: true
failure_modes: logged
Without provenance tracking, you will not know whether the policy improved because of source demos, scene randomization, real data, or an easier benchmark. That is why the next post, Part 5: VLA Data Scaling, will focus on scaling laws, diversity, and diminishing returns.
Conclusion
Synthetic data pipelines are the fastest way to turn a small set of high-quality demonstrations into a dataset broad enough for humanoid policy learning. MimicGen gives the object-centric recipe for single-arm tasks. DexMimicGen adds two-hand structure, coordination, and ordering. GR00T-Mimic brings the workflow into Isaac Lab and GR00T-style post-training. RoboCasa shows why scene diversity and kitchen benchmarks matter. Domain randomization reduces the gap, but it is not free: randomizing the wrong thing can make the policy slow, conservative, or wrong at contact.
The practical recommendation is simple: start with 20-50 clean real demonstrations, generate 1,000-5,000 synthetic episodes, train real-only and sim-real policies side by side, sweep the simulation sampling ratio, and decide using real-world success rate. Do not let simulation validation replace the real robot. In humanoid robotics, reality is still the final judge.