humanoidhumanoidsynthetic-datasim2realmimicgendexmimicgengroot-mimicisaac-labrobocasa

Synthetic Data Pipelines: From Sim to Real

How to turn a few human demos into thousands of robot trajectories with MimicGen, GR00T-Mimic, RoboCasa, and sim-real co-training.

Nguyễn Anh TuấnJune 12, 202616 min read
Synthetic Data Pipelines: From Sim to Real

Why part 4 is about synthetic data

The first three posts in this series moved from the ownership map to teleoperation and human video. If you are joining here, start with Part 1: the data war landscape, Part 2: teleoperation, and Part 3: human video mining. Part 4 answers the next practical question: if real robot data is expensive, slow, and hard to scale, how far can simulation amplify it before the model starts learning the wrong thing?

Synthetic data for humanoid robots is not just pretty rendered images. A useful synthetic data pipeline produces trajectories with actions, robot state, camera observations, object poses, physics, success labels, and a format that can be mixed with real data for policy training. For a task like "pick up a cup and place it into a tray", good synthetic data does not merely show the policy more cup colors. It must teach hand trajectories, gripper timing, contact with the tray, collision avoidance, and recovery from different initial placements.

The appeal is scale. MimicGen reports generating more than 50,000 new demonstrations across 18 tasks from about 200 source human demonstrations. DexMimicGen extends the idea to bimanual dexterous manipulation, generating more than 20,000 demonstrations from 60 source demonstrations across 9 tasks. NVIDIA's GR00T-Mimic blueprint pushes the workflow toward industrial scale: 780,000 synthetic trajectories, equivalent to 6,500 hours of demonstrations, generated in 11 hours; when combined with real data, NVIDIA reports a 40% GR00T N1 performance improvement over real-only training.

The warning is just as important: synthetic data does not replace reality. It amplifies good demonstrations, explores many variations, and reduces experiment cost. It does not automatically fix wrong physics, weak contact modeling, missing actuator delay, overly clean sensors, or force-heavy tasks such as insertion, screwing, zipping, wiping, and cable handling. The practical question is not "sim or real?". The better question is: how much simulation, which kind of simulation, and where must real data stay in the loop?

The pipeline from one demo to a dataset

Imagine you have a two-arm humanoid, or a mobile manipulator with two end-effectors. You want to teach it: "pick up a bowl, pour small objects into the bowl, and place the bowl on a tray." A minimum synthetic data pipeline looks like this:

1. Collect source demonstrations
   - sim teleoperation, real robot teleoperation, or retargeted human motion

2. Segment each demo into subtasks
   - reach bowl
   - grasp bowl
   - move to pouring zone
   - pour object
   - place bowl on tray

3. Attach each subtask to an object frame
   - bowl frame, tray frame, cup frame, drawer handle frame

4. Randomize the scene
   - object poses, color, texture, lighting, camera, friction, mass

5. Transform the trajectory
   - preserve end-effector relationships to objects in the new scene

6. Replay in simulation
   - run the controller, check collisions, check task success

7. Filter the data
   - keep successful episodes and discard bad grasps or unsafe collisions

8. Train the policy
   - behavior cloning, diffusion policy, VLA fine-tuning, or sim-real co-training

9. Evaluate on the real robot
   - measure the gap and add real demos around failure modes

For beginners, the core idea is in steps 3 and 5. Synthetic trajectory generation is not "add noise to actions". If the source demo says "move the right hand to absolute position x=0.42, y=-0.10", replaying that in a new scene fails when the bowl is somewhere else. A MimicGen-style pipeline stores relative motion: "move the hand to this pose relative to the bowl." When the bowl is randomized to a new location, the end-effector trajectory is transformed through the new object frame.

MimicGen: object-centric trajectory transformation

MimicGen is the cleanest example for single-arm manipulation, or tasks that can be described as a clear sequence of subtasks. The paper models each task as a sequence of object-centric subtasks. A source demonstration is split into contiguous segments, and each segment is tied to a reference object. When a new scene is randomized, the system reads the new object pose, computes the transform from the source object frame to the new object frame, applies that transform to the end-effector trajectory, replays it in simulation, and keeps the demonstration only if the task succeeds.

You can read the pipeline as a small program:

for episode_id in range(target_num_episodes):
    scene = randomize_scene(objects, cameras, physics)
    source_demo = sample(source_demos)
    generated_actions = []

    for subtask in source_demo.subtasks:
        ref_object = scene.object(subtask.object_name)
        delta = ref_object.pose @ inverse(subtask.source_object_pose)
        segment = transform_trajectory(subtask.ee_poses, delta)
        segment = bridge_from_current_robot_pose(segment)
        generated_actions.extend(segment)

    rollout = execute_open_loop(scene, generated_actions)
    if success(rollout) and no_bad_collision(rollout):
        save(rollout)

The important point is that MimicGen does not require thousands of people driving robots. It needs a smaller set of clean source demonstrations, correct subtask annotations, reliable object poses, and a simulator good enough to replay the generated motion. According to the OpenReview page, the authors generated more than 50,000 demonstrations for 18 tasks from about 200 human demos, then trained imitation learning agents for long-horizon and high-precision tasks such as assembly and coffee preparation.

For a small team, MimicGen is a good fit when the task has these properties:

Condition Good for synthetic data? Why
Object pose is available from sim or perception High Trajectory transforms need reliable object frames
The task can be split into clear subtasks High Reach, grasp, move, and place are easy to annotate
Contact is short and simple Medium high Pick-place is easier than insertion or screwing
Objects are deformable or cable-like Low Most simulators struggle with deformation fidelity
Success can be checked with rules High Automatic filtering removes many bad episodes

A common mistake is scaling too early. If 10 source demos contain poor grasp poses, 10,000 synthetic demos will simply teach the policy the poor behavior faster. Audit the source demonstrations before generation: replay each subtask, inspect cameras, check gripper state, verify object poses, and label failure cases.

DexMimicGen: when humanoids need two hands

A humanoid is not just a robot arm with legs attached. Many kitchen and warehouse tasks require two hands: one hand stabilizes a tray while the other places a cup; one hand opens a drawer while the other removes an item; both hands lift a box. This is why DexMimicGen exists.

DexMimicGen inherits the core MimicGen idea but adds three subtask types for bimanual dexterous manipulation:

Subtask type Example Main challenge
Parallel Each hand picks a different object The two arms do not need to finish at the same time
Coordination Both hands lift a tray Relative pose between end-effectors must be preserved
Sequential Pour into a bowl, then place the bowl One arm may need to wait for the other

In the paper, DexMimicGen uses per-arm segmentation, asynchronous action queues for parallel subtasks, synchronization for coordination subtasks, and ordering constraints for sequential subtasks. The numbers are important: the team generated about 21,000 demos from 60 source demos across 9 tasks; for a real2sim2real can-sorting task, the final visuomotor policy achieved 90% success, compared with 0% from using only the source demos.

For humanoids, the lesson is not merely "one demo becomes one thousand demos." The real lesson is task structure. If the task needs two arms, a single linear annotation sequence will corrupt the data. You need to know when the arms are independent, when they must stay synchronized, and when one arm must wait. This is also where data ownership becomes interesting: the source demos may be few, but the final value comes from the annotation schema, simulator, controller, and success filter.

GR00T-Mimic inside Isaac Lab

NVIDIA brings this idea closer to a production workflow with Isaac Lab Mimic and GR00T-Mimic. The Isaac Lab documentation describes collecting or loading recorded demonstrations, annotating subtasks, generating additional demonstrations, and training a policy. The docs also show Isaac Lab Mimic support for robots with multiple end-effectors, including a Fourier GR-1 humanoid pick-and-place example. One very practical detail: generated candidate demonstrations do not always succeed. The pipeline must filter them by success criteria.

The GR00T-Mimic flow can be understood like this:

source:
  demo_type: teleoperation
  device: Apple Vision Pro, SpaceMouse, keyboard, or prerecorded sample
  format: HDF5 or LeRobot-like episode data

annotation:
  subtasks:
    - left_hand_pick_object
    - right_hand_wait
    - handover_or_place
  method: heuristic_or_manual

generation:
  simulator: Isaac Lab on Isaac Sim
  mimic: transform_and_stitch_subtask_segments
  randomization:
    visual: [lighting, material, camera, background]
    physics: [mass, friction, damping, actuator_delay, sensor_noise]

training:
  policy: BC, diffusion policy, or GR00T-style VLA post-training
  mixture: real + synthetic

GR00T-Mimic should not be understood as a magic "generate data" button. The task environment must expose the right APIs: get end-effector pose, convert target pose to action, get object poses, read gripper actions, and define success. Without those functions, the system cannot transform old subtask segments into valid motions in new scenes.

If you have read our GR00T N1 + G1 data collection in Isaac Lab guide, treat GR00T-Mimic as the next stage after collecting source demonstrations. For broader Isaac Lab fundamentals, see Isaac Lab for robot learning from scratch.

RoboCasa: kitchens as a synthetic benchmark

RoboCasa matters because it places synthetic data inside environments that look more like real homes: kitchen scenes. The RoboCasa365 paper presents 365 tasks, 2,500 kitchen environments, more than 600 hours of human demonstrations, and more than 1,600 hours of synthetic demonstrations generated with MimicGen. That scale is directly relevant to the question: what can a generalist robot learn from household simulation?

For beginners, RoboCasa is useful at three levels:

Layer What it teaches
Scene diversity Cabinets, sinks, shelves, drawers, and counters are not just background
Task diversity Simple pick-place is not enough to represent household work
Benchmark discipline You need unseen objects, unseen layouts, and unseen tasks to measure generalization

If you train in one beautiful kitchen scene, the policy may learn shortcuts: countertop color, camera position, or the default bowl location. A RoboCasa-style setup forces scene randomization and evaluation on unseen environments. This is the most valuable part of synthetic data: not rendering images that look real, but creating many task configurations so the model depends less on a single layout.

Domain randomization: what should change?

Domain randomization is based on a simple idea: instead of making simulation match reality perfectly, randomize the parameters that may vary in the real world so the policy becomes robust. NVIDIA's Physical AI course lists common groups: object colors, textures, materials, lighting, camera pose, background, plus physics parameters such as mass, friction, restitution, damping, actuator delays, and sensor noise.

A practical starting table:

Group Parameter Starting range
Object placement x, y, yaw Enough to cover the reachable workspace
Visual texture, material, color Diverse, but keep the object class recognizable
Lighting intensity, direction Cover normal shadows and glare
Camera extrinsic jitter, FOV Small for fixed cameras, larger for head cameras
Contact friction, restitution Use measured values if available
Robot joint damping, delay, action noise Start small, then increase after the policy works
Sensor RGB noise, depth dropout Model real camera errors, not only Gaussian noise

The practical rule: randomize visual factors more aggressively first, and randomize physics more carefully. Visual randomization that is too broad can make policies conservative, but they often still learn. Physics randomization that is wrong can create actions that do not transfer to the real robot, especially in contact-rich tasks.

What sim-real ratio should you use?

There is no universal optimum. The paper Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation shows that co-training with synthetic simulation data can significantly improve real-world policies. The project page summarizes the recipe: use task-aware digital cousins, include diverse objects and placements, keep success criteria aligned with the real task, prefer similar camera viewpoints when possible, and use substantially more simulation than real data while tuning the sampling ratio.

A practical starting point for a small lab:

Real data available Sim sampling ratio in minibatches When to use
10-30 real demos 80-95% sim New task, very little real data, need a broad prior
50-200 real demos 60-85% sim Failure logs exist, need more robustness
200-500 real demos 40-70% sim Real data is decent, sim covers edge cases
More than 500 real demos 20-60% sim Sim should focus on rare or unsafe variations

This is a heuristic, not a law. The co-training paper uses settings such as 20 real demos and 1,000 simulation demos for CupPnP to study the sampling ratio, and emphasizes that the ratio must be tuned. In practice, run a small sweep:

experiments:
  - name: real_only
    sim_sampling_ratio: 0.0
  - name: sim25_real75
    sim_sampling_ratio: 0.25
  - name: sim50_real50
    sim_sampling_ratio: 0.50
  - name: sim75_real25
    sim_sampling_ratio: 0.75
  - name: sim90_real10
    sim_sampling_ratio: 0.90

metrics:
  - real_success_rate_seen_objects
  - real_success_rate_unseen_objects
  - contact_failure_rate
  - recovery_after_misgrasp
  - average_task_time

Do not select the ratio using simulation validation alone. Select it with real robot trials, ideally at least 20-50 trials per condition. If sim90_real10 improves success but makes the robot slow, hesitant, or overly contact-averse, your physics randomization may be too broad.

Walkthrough: building a synthetic dataset for a kitchen task

Suppose the task is "pick a can from the counter, place it into a tray, then pull the tray toward the table edge." This is a reasonable MimicGen or DexMimicGen task because it has clear object frames and contact is not too complex.

Task: can_to_tray_then_pull_tray
Robot: humanoid upper body, two arms, parallel grippers
Observation: head RGBD + wrist cameras + proprioception
Action: left/right end-effector pose delta + gripper command
Success:
  - can inside tray
  - tray final pose within target zone
  - no can drop
  - no unsafe table collision

Implementation steps:

  1. Collect 20 source demos through teleoperation. Do not optimize for speed. Each demo should succeed, move smoothly, and keep safe clearance from distracting objects.

  2. Segment the subtasks:

subtasks:
  left_arm:
    - idle_until_can_in_tray
    - grasp_tray_handle
    - pull_tray_to_target
  right_arm:
    - reach_can
    - grasp_can
    - place_can_in_tray
    - release_can
  1. Define object frames: can, tray, tray_handle, and target_zone. If the real perception stack cannot estimate these frames, the synthetic pipeline will not transfer well.

  2. Randomize the scene: can pose, tray pose, can color, table texture, lighting, camera jitter, and friction between tray and table.

  3. Generate 5,000 candidate episodes. For a bimanual task, you may keep only 1,000-2,000 successful episodes if the success rate is low.

  4. Train three baselines: real-only, synthetic-only, and sim-real co-training.

  5. Test on the real robot. Label failures by group: missed grasp, can slip, tray stuck, collision, visual confusion, timeout.

  6. Add real demonstrations around the failure modes, not just more easy demos.

The last point matters most. Synthetic data is best used as a loop: simulation creates coverage, real testing reveals the gap, simulation ranges or assets are updated, and real demonstrations are added where the simulator cannot model the task.

Limits: physics fidelity and contact-rich tasks

Synthetic data often fails where the robot has to "feel" the world. Pick-place with rigid objects can work well. Tight cap opening, USB insertion, cable pulling, surface wiping, cloth folding, small-travel buttons, and sliding thin objects into slots are much harder. The reasons are familiar:

Limit Result
Simplified contact model The policy learns the wrong contact angle or force
Wrong friction Objects slide or stick differently than in sim
Missing actuator delay Sim motions are too sharp; the real robot overshoots
Overly clean sensors The policy depends on unrealistic depth or segmentation
Incorrect object mesh A grasp looks correct, but collision geometry is wrong
Weak deformable simulation Cables, cloth, bags, and soft packaging transfer poorly

For contact-rich tasks, reduce expectations for synthetic-only training. Use simulation to learn approach, search, pre-grasp, and variation. Use real data to learn the final contact phase. A practical recipe:

sim-heavy:
  reach, align, coarse grasp, object search, collision avoidance

real-heavy:
  insertion, twisting, force closure, sliding contact, tactile correction

hybrid:
  train on sim + real, but oversample real frames near contact events

If you have tactile or force-torque sensors, log them from the beginning. Synthetic video can look excellent, but contact policies without force signals often remain brittle.

Synthetic data in the ownership war

In the 2026 humanoid data war, synthetic pipelines create a new ownership layer. Source demos may come from operators. The simulator may come from a vendor. Kitchen assets may have separate licenses. The annotation schema belongs to the robot team. Generated trajectories run on company GPUs. Real evaluation logs may come from customers. Once a model is trained, the value is not in a single file; it is in the whole pipeline.

Teams should manage synthetic data like an engineering product:

dataset_card:
  source_demos:
    owner: internal_teleop_team
    consent: operator_agreement_v2
  simulator:
    engine: Isaac Lab
    version: pinned
  assets:
    kitchen_scenes: licensed_or_internal
    object_meshes: provenance_tracked
  generation:
    randomization_seed: stored
    success_filter: documented
  mixture:
    real_sampling_ratio: tracked
    sim_sampling_ratio: tracked
  deployment:
    evaluated_on_real_robot: true
    failure_modes: logged

Without provenance tracking, you will not know whether the policy improved because of source demos, scene randomization, real data, or an easier benchmark. That is why the next post, Part 5: VLA Data Scaling, will focus on scaling laws, diversity, and diminishing returns.

Conclusion

Synthetic data pipelines are the fastest way to turn a small set of high-quality demonstrations into a dataset broad enough for humanoid policy learning. MimicGen gives the object-centric recipe for single-arm tasks. DexMimicGen adds two-hand structure, coordination, and ordering. GR00T-Mimic brings the workflow into Isaac Lab and GR00T-style post-training. RoboCasa shows why scene diversity and kitchen benchmarks matter. Domain randomization reduces the gap, but it is not free: randomizing the wrong thing can make the policy slow, conservative, or wrong at contact.

The practical recommendation is simple: start with 20-50 clean real demonstrations, generate 1,000-5,000 synthetic episodes, train real-only and sim-real policies side by side, sweep the simulation sampling ratio, and decide using real-world success rate. Do not let simulation validation replace the real robot. In humanoid robotics, reality is still the final judge.

NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Related Posts

VLA Data Scaling: Luật Scaling Cho Robot Learning
humanoid

VLA Data Scaling: Luật Scaling Cho Robot Learning

6/12/202619 min read
NT
Open vs Closed: License, Data Moat Và Tương Lai 2027
humanoid

Open vs Closed: License, Data Moat Và Tương Lai 2027

6/12/202617 min read
NT
Teleoperation: Thu Thập Dữ Liệu Robot Thực Tế
humanoid

Teleoperation: Thu Thập Dữ Liệu Robot Thực Tế

6/12/202616 min read
NT