VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam
VnRobo
AboutPricingBlogContact
🇻🇳VISign InStart Free Trial
🇻🇳VI
  1. Home
  2. Blog
  3. M3imic: Multimodal WBC for G1
wholebody-vlaM3imicUnitree G1Isaac Labwhole-body controlmultimodal imitationsim-to-realRSL-RL

M3imic: Multimodal WBC for G1

A beginner-friendly guide to M3imic: idea, architecture, Isaac Lab setup, training, inference, and Unitree G1 results.

Nguyễn Anh TuấnJune 15, 202614 min read
M3imic: Multimodal WBC for G1

What problem does M3imic solve?

M3imic, short for Multi-Modal Mimic, is a recent paper on training a versatile whole-body controller for humanoids from heterogeneous motion references. The original paper is M3imic: Learning a Versatile Whole-Body Controller for Multimodal Motion Mimicking, and the official code is available at Renforce-Dynamics/MultiModalWBC. The repository is built on NVIDIA Isaac Sim / Isaac Lab 2.1.1, integrates RSL-RL, and currently targets the 29-DoF Unitree G1 humanoid.

If you have followed our posts on the whole-body VLA open-source landscape, G1 WBC deployment, or ASAP for Unitree G1, M3imic sits in the same practical territory: motion imitation, whole-body reinforcement learning, and sim-to-real transfer. Its key contribution is not another single-modality tracking policy. The main idea is to train one low-level controller that can consume different kinds of motion references through a shared latent command space.

Tool recommendations

VLA train/deploy stack

Train on cloud/workstation, then deploy optimized models to Jetson or the robot computer.

Cloud GPU for VLA / policy training Use for imitation learning, diffusion policies, RL, and robotics model fine-tuning. View cloud GPU → NVIDIA Jetson Orin NX / Orin Nano Edge deployment hardware for perception, logging, and optimized inference. View Jetson → Hugging Face / robotics dataset hosting Host datasets, checkpoints, and model cards for cleaner LeRobot/VLA workflows. View platform →

For a beginner, the easiest way to understand M3imic is to ask: what should a humanoid controller use as its command? For locomotion and dance-style motion tracking, dense robot joint trajectories are convenient because every G1 joint has a clear target. For teleoperation, you often only have sparse end-effector poses from mocap or VR: hands, feet, chest, or pelvis. For large human motion datasets, you may have SMPL-X body pose rather than ready-to-use G1 joint angles. All of these signals are useful, but their shapes and semantics are very different.

A common workaround is to convert every motion reference into target robot joints with inverse kinematics. That makes the downstream controller simpler, but it adds an IK dependency during deployment and collapses useful ambiguity in sparse commands. Another approach concatenates all modalities and trains a teacher policy, then distills to a student that can handle missing inputs. That works in some systems, but the multi-stage process introduces distribution mismatch and extra engineering.

M3imic takes a cleaner route. Each modality gets its own encoder. The encoders map robot joint angles, SMPL-X human body pose, and SE(3) end-effector keypoints into a shared 64-dimensional latent space. A single actor policy then consumes robot proprioception plus that latent command and outputs G1 joint position actions. The paper evaluates the method both in simulation and on a real Unitree G1. On the unseen OMOMO test set, the end-effector policy reaches a 98.42% success rate, while the robot-joint policy obtains the best pose and joint-angle tracking accuracy.

M3imic concept: instead of relying on IK or multi-stage distillation, the method learns a shared latent command space for heterogeneous motion references — source: M3imic arXiv paper
M3imic concept: instead of relying on IK or multi-stage distillation, the method learns a shared latent command space for heterogeneous motion references — source: M3imic arXiv paper

Beginner mental model

Imagine that you are teaching a G1 to imitate motion and you have three instructors:

Modality Input signal Why it matters
Robot joint angles 29 G1 joint angles High-fidelity tracking when motions are already retargeted
Human body pose SMPL-X, 21 joints, 6D rotations Reusing large-scale human motion datasets
End-effector poses 5 SE(3) keypoints: hands, feet, chest Teleoperation and sparse mocap/VR commands

A classic low-level controller prefers explicit commands such as "move this joint to this target angle." But a teleoperation system may not know those targets. If a human operator moves the right hand forward, there are many valid whole-body configurations that satisfy the hand motion: the robot can rotate the torso, shift the pelvis, bend the knees, or change stance to remain balanced. This is kinematic redundancy. Sparse commands are less precise at the joint level, but they leave more freedom for the controller to maintain balance.

M3imic turns that difference into an advantage. Robot-joint references provide tracking fidelity. End-effector references provide robustness under distribution shift. Human-pose references help connect large human datasets to the robot. A shared latent space makes these inputs compatible with one policy, so deployment does not require a separate controller for every command type.

In plain terms:

reference sequence
  -> robot encoder / human encoder / keypoint encoder
  -> latent z, 64 dimensions
  -> actor(policy) + robot proprioception
  -> joint position action for Unitree G1

The paper uses a short reference horizon of 10 frames, sampled every 2 steps. The encoder therefore sees a small window of upcoming motion, not just the current frame. This makes the command more like a short motion intent. If you have trained velocity-command locomotion with PPO, think of this as replacing a single velocity target with a compact representation of the next few frames of body motion.

Architecture of M3imic

The full architecture has four important pieces: data preprocessing, multi-modal command encoding, asymmetric actor-critic learning, and adaptive curriculum sampling.

M3imic framework: preprocess motion data, encode multiple modalities into a shared latent command, train one policy, and deploy it with different input modalities — source: M3imic arXiv paper
M3imic framework: preprocess motion data, encode multiple modalities into a shared latent command, train one policy, and deploy it with different input modalities — source: M3imic arXiv paper

Data preprocessing uses LAFAN1 and 100STYLE as training datasets. LAFAN1 contains diverse and dynamic human motions, while 100STYLE focuses on many walking styles. The test set is the OMOMO test subset, which contains more complex object-manipulation motions. The paper refines SMPL-X human body models with Blender and uses GMR for retargeting. The public repo also includes CSV/NPZ utilities under scripts/data/, and the README points to a preprocessed ModelScope dataset named seulzx/gae_mimic_dataset.

Multi-modal command encoding uses three encoders and three decoders. The robot input is the 29-dimensional joint-angle vector q_t. The human input is a 21 x 6 SMPL-X body-pose representation, where each joint rotation is represented in 6D. The end-effector input is 5 x 9, combining root-relative 3D positions and 6D rotations for five keypoints. Each encoder maps a short reference sequence to a 64-dimensional latent code, and each decoder reconstructs its modality.

Latent alignment is the central representation-learning mechanism. The autoencoder objective includes:

Loss Practical meaning
Reconstruction loss Each encoder-decoder pair must preserve information from its original modality
Alignment loss Latents from robot, human, and keypoint inputs for the same motion should be close
Consistency loss Decoding different modality latents back to robot references should be consistent

This is why the repo describes whole-body control as a multi-modal sequence alignment problem. The system is not merely training a policy to avoid falling. It also trains the encoders so that different command sources speak a compatible latent language.

Asymmetric actor-critic is a standard but important sim-to-real design. The actor only uses information that can realistically be available during deployment: the latent command, root rotation error, base angular velocity, joint positions, joint velocities, and previous action. It deliberately excludes global root position and linear velocity, because real robots often do not have reliable external localization. The critic, used only during training, receives privileged simulation information such as root position error, body link positions and orientations, and root linear velocity.

Action output is joint position control. The repository defines G1 in whole_body_control/robots/g1.py using Isaac Lab ArticulationCfg and ImplicitActuatorCfg, with joint limits, initial posture, stiffness/damping groups, and action scaling. The GAEMimic_G1FlatEnvCfg task sets pelvis as the anchor body and tracks major links such as pelvis, hip roll, knees, ankle roll, torso, shoulder roll, elbows, and wrist yaw.

Installing the Isaac Lab environment

The repo expects Linux, Python 3.10, Isaac Sim 4.5.0, Isaac Lab 2.1.1, and a compatible NVIDIA GPU stack. Do not install this into a shared robotics environment that already runs other projects. Create a dedicated environment:

conda create -n env_mimic python=3.10
conda activate env_mimic
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu128

Install Isaac Sim through NVIDIA's pip index:

pip install 'isaacsim[all,extscache]==4.5.0' --extra-index-url https://pypi.nvidia.com
isaacsim isaacsim.exp.full.kit

Once the Isaac Sim GUI starts correctly, install Isaac Lab at the commit specified in the repo:

git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab
git checkout 90b79bb2d44feb8d833f260f2bf37da3487180ba
./isaaclab.sh -i
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Ant-v0 --headless

The Ant task is just a smoke test. If it runs, your Isaac Lab Python environment, extensions, simulator launch path, and GPU setup are in reasonable shape.

Then install RSL-RL and the M3imic package:

cd /path/to/IsaacLab
./isaaclab.sh -p -m pip install -e /path/to/MultiModalWBC/third_party/rsl_rl

cd /path/to/MultiModalWBC/source/whole_body_control
pip install -e .

The environment also needs Unitree assets:

git clone https://huggingface.co/datasets/unitreerobotics/unitree_model

Before training, verify asset paths, joint order, and action scale. On a humanoid, a visually small mismatch can make a policy unusable. A missing texture is harmless; a swapped hip joint or a wrong PD gain is not.

Dataset layout

The README points to preprocessed data with SMPL-X and keypoints:

https://www.modelscope.cn/datasets/seulzx/gae_mimic_dataset

Place the unzipped data under datasets/. The package defines two dataset roots:

datasets/npz_datasets
datasets/extended_datasets

The public GAEMimic_G1FlatEnvCfg defaults to:

datasets/extended_datasets/lafan1_dataset
split: train

There is also a commented path for 100style_dataset. For a first run, keep the dataset small and confirm that the task registry works:

python scripts/tools/list_envs.py

The README lists three task families:

Task family Purpose
TrackingEnvCfg Single reference motion tracking
MultiTracking_TrackingEnvCfg Multi-motion tracking from several motion clips
GAEMimic_TrackingEnvCfg Multi-modal imitation with robot, SMPL-X, and SE(3) keypoints

For a basic pipeline test, the README starts with:

python scripts/rsl_rl/train.py \
  --headless \
  --task MultiTracking-Flat-G1-v0

For true M3imic-style multi-modal training, use the GAEMimic task that appears in your local list_envs.py output after installation. Isaac Lab registers tasks dynamically, so always trust the environment list over a copied task name.

Training the policy

The G1 PPO runner in the repo uses:

Parameter Repo value
num_steps_per_env 24
max_iterations 30,000 for single tracking, 50,000 for multi/GAEMimic
Actor hidden dims [512, 256, 128]
Critic hidden dims [512, 256, 128]
Activation elu
Learning rate 1e-3, adaptive schedule
gamma, lam 0.99 and 0.95
desired_kl 0.01
Entropy coefficient 0.005

The GAEMimic configuration uses RslRl_Triple_AE_PPOPolicyCfg. The important signal dimensions are:

robot signal dim:     290
human signal dim:     1260  # 126 x 10 frames
keypoints signal dim: 450   # 5 keypoints x 9 dims x 10 frames
latent dim:           64

In the public config, activate_signals="robot" is annotated as using robot signals for zero-shot training. If you want to switch modalities, inspect actor_sonic.py, actor_critic_triple_ae.py, and the task observation code. Do not assume that changing one string automatically validates a full multi-modal deployment path.

Start with a small debug run:

python scripts/rsl_rl/train.py \
  --headless \
  --task MultiTracking-Flat-G1-v0 \
  --num_envs 256 \
  --max_iterations 1000 \
  --video \
  --video_interval 500

Once the environment is stable, increase to 2048 or 4096 environments, disable video for speed, and train for the intended iteration count. The paper computes simulation metrics from 10,000 simulation steps over 4096 parallel environments, and the ablations train for 50,000 iterations on 4 RTX 4090 GPUs. A single workstation GPU can still be useful, but you should expect longer training and may need fewer environments.

The reward terms include tracking and regularization:

Reward group Meaning
Root velocity/orientation Track root motion and heading
Body position/orientation Track major link poses
Body velocity/angular velocity Track dynamic motion, not just static pose
Action rate penalty Reduce jerky control
Joint limit penalty Avoid soft joint-limit violations
Collision penalty Penalize unwanted collisions

The sim-to-real setup uses domain randomization: friction in [0.1, 1.6], robot pushes up to roughly [-0.5, 0.5] m/s, base COM and mass changes, default joint-position variation, anchor orientation noise, base angular-velocity noise, joint-position noise, and joint-velocity noise. If your policy only works without randomization, treat it as a simulator result, not a deployment-ready controller.

Inference and policy export

After training, checkpoints are stored under:

logs/rsl_rl/<experiment_name>/<run_id>/

Run inference with play.py:

python scripts/rsl_rl/play.py \
  --task MultiTracking-Flat-G1-v0 \
  --num_envs 16 \
  --checkpoint /path/to/model.pt \
  --video \
  --video_length 400

The script loads the checkpoint with OnPolicyRunner, obtains the inference policy, and exports both TorchScript and ONNX:

<run_dir>/exported/policy.pt
<run_dir>/exported/policy.onnx

That export is the beginning of deployment, not the end. Before running on a real G1, use a staged validation path:

  1. Sim replay: confirm the policy runs in Isaac Lab without falling, action saturation, or unstable contacts.
  2. Sim-to-sim: test in another simulator if available, and verify joint order, PD gains, action scale, and control frequency.
  3. Low-power hardware test: run standing and small-amplitude motions first, with an emergency stop and full logging.
  4. Teleoperation test: increase command complexity only after sparse commands do not cause drift, crouching, or unstable foot contacts.

The paper reports real-world deployment without collecting teleoperation-specific training data. For in-domain tracking, the robot encoder tracks motions such as dancing, running, and walking. For out-of-domain teleoperation, the team uses an optical motion-capture system to obtain end-effector commands from human operators. The average real-world results are close: the robot-joint policy reports 41.63 mm, 0.095 rad, 0.260 m/s, while the end-effector policy reports 43.22 mm, 0.105 rad, 0.268 m/s. The sparse interface is slightly less accurate but much more practical for teleoperation.

Real-world teleoperation with optical motion capture: bending, raising hands, walking, jogging, boxing, squatting, and pushing objects — source: M3imic arXiv paper
Real-world teleoperation with optical motion capture: bending, raising hands, walking, jogging, boxing, squatting, and pushing objects — source: M3imic arXiv paper

Results worth remembering

On the LAFAN1 + 100STYLE training datasets, M3imic reports:

Method Success MPKPE MPJAE Velocity error
HOVER 87.35% 128.20 mm 0.686 rad 0.481 m/s
ExBody2 98.12% 53.25 mm 0.146 rad 0.285 m/s
OmniH2O 97.78% 62.75 mm 0.154 rad 0.307 m/s
TWIST2 98.89% 51.65 mm 0.121 rad 0.267 m/s
M3imic 99.54% 46.05 mm 0.112 rad 0.256 m/s

On the unseen OMOMO test set:

Policy Success MPKPE MPJAE Velocity error
M3imic robot-joint pi^r 95.98% 71.52 mm 0.139 rad 0.341 m/s
M3imic human-pose pi^h 95.23% 72.21 mm 0.140 rad 0.339 m/s
M3imic end-effector pi^e 98.42% 75.52 mm 0.142 rad 0.337 m/s

The practical takeaway is clear: dense robot references improve tracking fidelity, while sparse end-effector references can improve robustness under distribution shift. This matters for teleoperation and VLA-style systems. If a high-level model only knows that a hand should move to a target pose, forcing it through rigid IK into a full-body joint reference may remove useful balance freedom. A learned low-level controller with a softer command interface can be a better abstraction.

t-SNE visualization of M3imic latent space for different 100STYLE motion categories, showing that the encoder separates motion groups — source: M3imic arXiv paper
t-SNE visualization of M3imic latent space for different 100STYLE motion categories, showing that the encoder separates motion groups — source: M3imic arXiv paper

Practical checklist

For a small lab reproduction, use this order:

  1. Install Isaac Sim 4.5 and Isaac Lab at the required commit.
  2. Install the M3imic package with pip install -e.
  3. Run scripts/tools/list_envs.py.
  4. Download the preprocessed dataset and place it under datasets/extended_datasets.
  5. Start with --num_envs 256 --max_iterations 1000.
  6. Record a short video and check that G1 spawns correctly and does not saturate actions.
  7. Scale to 2048-4096 environments when the debug run is stable.
  8. Use play.py to export TorchScript and ONNX.
  9. Move toward sim-to-sim and hardware only after checking joint order, control rate, PD gains, action scale, and emergency-stop procedures.

M3imic is not a shortcut around robotics fundamentals. It is a better way to organize representation learning and RL for whole-body control. The deployment risks remain concrete: wrong joint mapping, simulator delay mismatch, torque limits, network jitter, unsafe reference motions, and insufficient hardware safeguards.

Related Posts

  • Whole-body VLA open-source guide
  • Deploy G1 whole-body controller
  • ASAP: train agile skills for Unitree G1
NT

Nguyễn Anh Tuấn

Robotics & AI Engineer. Building VnRobo — sharing knowledge about robot learning, VLA models, and automation.

Khám phá VnRobo

Fleet MonitoringROS 2 IntegrationAMR Solutions

Related Posts

NEWDeep Dive
Kiến trúc SONIC cho WBC humanoid
GR00TSONICwhole-body controlPart 1
wholebody-vla

Kiến trúc SONIC cho WBC humanoid

Bóc tách GR00T-WholeBodyControl qua decoupled_wbc, gear_sonic và gear_sonic_deploy để hiểu SONIC từ paper đến repo.

6/13/202614 min read
NT
NEWTutorial
ASAP: train agile skills cho Unitree G1
ASAPUnitree G1sim-to-real
wholebody-vla

ASAP: train agile skills cho Unitree G1

Hướng dẫn ASAP của LeCAR-Lab: motion tracking, delta action model, fine-tuning và deploy sim-to-real cho Unitree G1.

6/12/202615 min read
NT
NEWTutorial
OASIS: Thu thập dữ liệu humanoid trong Isaac Lab
OASIShumanoidIsaac Lab
wholebody-vla

OASIS: Thu thập dữ liệu humanoid trong Isaac Lab

Hướng dẫn OASIS: dựng asset, teleop trong Isaac Lab, render domain randomization, train policy toàn thân và deploy zero-shot cho Unitree G1.

6/11/202616 min read
NT
VnRobo logo

AI infrastructure for next-generation industrial robots.

Product

  • Features
  • Pricing
  • Knowledge Base
  • Services

Company

  • About Us
  • Blog
  • Contact

Legal

  • Privacy Policy
  • Terms of Service

© 2026 VnRobo. All rights reserved.

Made with♥in Vietnam