M3imic: Multimodal WBC for G1

What problem does M3imic solve?

M3imic, short for Multi-Modal Mimic, is a recent paper on training a versatile whole-body controller for humanoids from heterogeneous motion references. The original paper is M3imic: Learning a Versatile Whole-Body Controller for Multimodal Motion Mimicking, and the official code is available at Renforce-Dynamics/MultiModalWBC. The repository is built on NVIDIA Isaac Sim / Isaac Lab 2.1.1, integrates RSL-RL, and currently targets the 29-DoF Unitree G1 humanoid.

If you have followed our posts on the whole-body VLA open-source landscape, G1 WBC deployment, or ASAP for Unitree G1, M3imic sits in the same practical territory: motion imitation, whole-body reinforcement learning, and sim-to-real transfer. Its key contribution is not another single-modality tracking policy. The main idea is to train one low-level controller that can consume different kinds of motion references through a shared latent command space.

For a beginner, the easiest way to understand M3imic is to ask: what should a humanoid controller use as its command? For locomotion and dance-style motion tracking, dense robot joint trajectories are convenient because every G1 joint has a clear target. For teleoperation, you often only have sparse end-effector poses from mocap or VR: hands, feet, chest, or pelvis. For large human motion datasets, you may have SMPL-X body pose rather than ready-to-use G1 joint angles. All of these signals are useful, but their shapes and semantics are very different.

A common workaround is to convert every motion reference into target robot joints with inverse kinematics. That makes the downstream controller simpler, but it adds an IK dependency during deployment and collapses useful ambiguity in sparse commands. Another approach concatenates all modalities and trains a teacher policy, then distills to a student that can handle missing inputs. That works in some systems, but the multi-stage process introduces distribution mismatch and extra engineering.

M3imic takes a cleaner route. Each modality gets its own encoder. The encoders map robot joint angles, SMPL-X human body pose, and SE(3) end-effector keypoints into a shared 64-dimensional latent space. A single actor policy then consumes robot proprioception plus that latent command and outputs G1 joint position actions. The paper evaluates the method both in simulation and on a real Unitree G1. On the unseen OMOMO test set, the end-effector policy reaches a 98.42% success rate, while the robot-joint policy obtains the best pose and joint-angle tracking accuracy.

M3imic concept: instead of relying on IK or multi-stage distillation, the method learns a shared latent command space for heterogeneous motion references — source: M3imic arXiv paper

Beginner mental model

Imagine that you are teaching a G1 to imitate motion and you have three instructors:

Modality	Input signal	Why it matters
Robot joint angles	29 G1 joint angles	High-fidelity tracking when motions are already retargeted
Human body pose	SMPL-X, 21 joints, 6D rotations	Reusing large-scale human motion datasets
End-effector poses	5 SE(3) keypoints: hands, feet, chest	Teleoperation and sparse mocap/VR commands

A classic low-level controller prefers explicit commands such as "move this joint to this target angle." But a teleoperation system may not know those targets. If a human operator moves the right hand forward, there are many valid whole-body configurations that satisfy the hand motion: the robot can rotate the torso, shift the pelvis, bend the knees, or change stance to remain balanced. This is kinematic redundancy. Sparse commands are less precise at the joint level, but they leave more freedom for the controller to maintain balance.

M3imic turns that difference into an advantage. Robot-joint references provide tracking fidelity. End-effector references provide robustness under distribution shift. Human-pose references help connect large human datasets to the robot. A shared latent space makes these inputs compatible with one policy, so deployment does not require a separate controller for every command type.

In plain terms:

reference sequence
  -> robot encoder / human encoder / keypoint encoder
  -> latent z, 64 dimensions
  -> actor(policy) + robot proprioception
  -> joint position action for Unitree G1

The paper uses a short reference horizon of 10 frames, sampled every 2 steps. The encoder therefore sees a small window of upcoming motion, not just the current frame. This makes the command more like a short motion intent. If you have trained velocity-command locomotion with PPO, think of this as replacing a single velocity target with a compact representation of the next few frames of body motion.

Architecture of M3imic

The full architecture has four important pieces: data preprocessing, multi-modal command encoding, asymmetric actor-critic learning, and adaptive curriculum sampling.

M3imic framework: preprocess motion data, encode multiple modalities into a shared latent command, train one policy, and deploy it with different input modalities — source: M3imic arXiv paper

Data preprocessing uses LAFAN1 and 100STYLE as training datasets. LAFAN1 contains diverse and dynamic human motions, while 100STYLE focuses on many walking styles. The test set is the OMOMO test subset, which contains more complex object-manipulation motions. The paper refines SMPL-X human body models with Blender and uses GMR for retargeting. The public repo also includes CSV/NPZ utilities under scripts/data/, and the README points to a preprocessed ModelScope dataset named seulzx/gae_mimic_dataset.

Multi-modal command encoding uses three encoders and three decoders. The robot input is the 29-dimensional joint-angle vector q_t. The human input is a 21 x 6 SMPL-X body-pose representation, where each joint rotation is represented in 6D. The end-effector input is 5 x 9, combining root-relative 3D positions and 6D rotations for five keypoints. Each encoder maps a short reference sequence to a 64-dimensional latent code, and each decoder reconstructs its modality.

Latent alignment is the central representation-learning mechanism. The autoencoder objective includes:

Loss	Practical meaning
Reconstruction loss	Each encoder-decoder pair must preserve information from its original modality
Alignment loss	Latents from robot, human, and keypoint inputs for the same motion should be close
Consistency loss	Decoding different modality latents back to robot references should be consistent

This is why the repo describes whole-body control as a multi-modal sequence alignment problem. The system is not merely training a policy to avoid falling. It also trains the encoders so that different command sources speak a compatible latent language.

Asymmetric actor-critic is a standard but important sim-to-real design. The actor only uses information that can realistically be available during deployment: the latent command, root rotation error, base angular velocity, joint positions, joint velocities, and previous action. It deliberately excludes global root position and linear velocity, because real robots often do not have reliable external localization. The critic, used only during training, receives privileged simulation information such as root position error, body link positions and orientations, and root linear velocity.

Action output is joint position control. The repository defines G1 in whole_body_control/robots/g1.py using Isaac Lab ArticulationCfg and ImplicitActuatorCfg, with joint limits, initial posture, stiffness/damping groups, and action scaling. The GAEMimic_G1FlatEnvCfg task sets pelvis as the anchor body and tracks major links such as pelvis, hip roll, knees, ankle roll, torso, shoulder roll, elbows, and wrist yaw.

Installing the Isaac Lab environment

The repo expects Linux, Python 3.10, Isaac Sim 4.5.0, Isaac Lab 2.1.1, and a compatible NVIDIA GPU stack. Do not install this into a shared robotics environment that already runs other projects. Create a dedicated environment:

conda create -n env_mimic python=3.10
conda activate env_mimic
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu128

Install Isaac Sim through NVIDIA's pip index:

pip install 'isaacsim[all,extscache]==4.5.0' --extra-index-url https://pypi.nvidia.com
isaacsim isaacsim.exp.full.kit

Once the Isaac Sim GUI starts correctly, install Isaac Lab at the commit specified in the repo:

git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab
git checkout 90b79bb2d44feb8d833f260f2bf37da3487180ba
./isaaclab.sh -i
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Ant-v0 --headless

The Ant task is just a smoke test. If it runs, your Isaac Lab Python environment, extensions, simulator launch path, and GPU setup are in reasonable shape.

Then install RSL-RL and the M3imic package:

cd /path/to/IsaacLab
./isaaclab.sh -p -m pip install -e /path/to/MultiModalWBC/third_party/rsl_rl

cd /path/to/MultiModalWBC/source/whole_body_control
pip install -e .

The environment also needs Unitree assets:

git clone https://huggingface.co/datasets/unitreerobotics/unitree_model

Before training, verify asset paths, joint order, and action scale. On a humanoid, a visually small mismatch can make a policy unusable. A missing texture is harmless; a swapped hip joint or a wrong PD gain is not.

Dataset layout

The README points to preprocessed data with SMPL-X and keypoints:

https://www.modelscope.cn/datasets/seulzx/gae_mimic_dataset

Place the unzipped data under datasets/. The package defines two dataset roots:

datasets/npz_datasets
datasets/extended_datasets

The public GAEMimic_G1FlatEnvCfg defaults to:

datasets/extended_datasets/lafan1_dataset
split: train

There is also a commented path for 100style_dataset. For a first run, keep the dataset small and confirm that the task registry works:

python scripts/tools/list_envs.py

The README lists three task families:

Task family	Purpose
`TrackingEnvCfg`	Single reference motion tracking
`MultiTracking_TrackingEnvCfg`	Multi-motion tracking from several motion clips
`GAEMimic_TrackingEnvCfg`	Multi-modal imitation with robot, SMPL-X, and SE(3) keypoints

For a basic pipeline test, the README starts with:

python scripts/rsl_rl/train.py \
  --headless \
  --task MultiTracking-Flat-G1-v0

For true M3imic-style multi-modal training, use the GAEMimic task that appears in your local list_envs.py output after installation. Isaac Lab registers tasks dynamically, so always trust the environment list over a copied task name.

Training the policy

The G1 PPO runner in the repo uses:

Parameter	Repo value
`num_steps_per_env`	24
`max_iterations`	30,000 for single tracking, 50,000 for multi/GAEMimic
Actor hidden dims	`[512, 256, 128]`
Critic hidden dims	`[512, 256, 128]`
Activation	`elu`
Learning rate	`1e-3`, adaptive schedule
`gamma`, `lam`	0.99 and 0.95
`desired_kl`	0.01
Entropy coefficient	0.005

The GAEMimic configuration uses RslRl_Triple_AE_PPOPolicyCfg. The important signal dimensions are:

robot signal dim:     290
human signal dim:     1260  # 126 x 10 frames
keypoints signal dim: 450   # 5 keypoints x 9 dims x 10 frames
latent dim:           64

In the public config, activate_signals="robot" is annotated as using robot signals for zero-shot training. If you want to switch modalities, inspect actor_sonic.py, actor_critic_triple_ae.py, and the task observation code. Do not assume that changing one string automatically validates a full multi-modal deployment path.

Start with a small debug run:

python scripts/rsl_rl/train.py \
  --headless \
  --task MultiTracking-Flat-G1-v0 \
  --num_envs 256 \
  --max_iterations 1000 \
  --video \
  --video_interval 500

Once the environment is stable, increase to 2048 or 4096 environments, disable video for speed, and train for the intended iteration count. The paper computes simulation metrics from 10,000 simulation steps over 4096 parallel environments, and the ablations train for 50,000 iterations on 4 RTX 4090 GPUs. A single workstation GPU can still be useful, but you should expect longer training and may need fewer environments.

The reward terms include tracking and regularization:

Reward group	Meaning
Root velocity/orientation	Track root motion and heading
Body position/orientation	Track major link poses
Body velocity/angular velocity	Track dynamic motion, not just static pose
Action rate penalty	Reduce jerky control
Joint limit penalty	Avoid soft joint-limit violations
Collision penalty	Penalize unwanted collisions

The sim-to-real setup uses domain randomization: friction in [0.1, 1.6], robot pushes up to roughly [-0.5, 0.5] m/s, base COM and mass changes, default joint-position variation, anchor orientation noise, base angular-velocity noise, joint-position noise, and joint-velocity noise. If your policy only works without randomization, treat it as a simulator result, not a deployment-ready controller.

Inference and policy export

After training, checkpoints are stored under:

logs/rsl_rl/<experiment_name>/<run_id>/

Run inference with play.py:

python scripts/rsl_rl/play.py \
  --task MultiTracking-Flat-G1-v0 \
  --num_envs 16 \
  --checkpoint /path/to/model.pt \
  --video \
  --video_length 400

The script loads the checkpoint with OnPolicyRunner, obtains the inference policy, and exports both TorchScript and ONNX:

<run_dir>/exported/policy.pt
<run_dir>/exported/policy.onnx

That export is the beginning of deployment, not the end. Before running on a real G1, use a staged validation path:

Sim replay: confirm the policy runs in Isaac Lab without falling, action saturation, or unstable contacts.
Sim-to-sim: test in another simulator if available, and verify joint order, PD gains, action scale, and control frequency.
Low-power hardware test: run standing and small-amplitude motions first, with an emergency stop and full logging.
Teleoperation test: increase command complexity only after sparse commands do not cause drift, crouching, or unstable foot contacts.

The paper reports real-world deployment without collecting teleoperation-specific training data. For in-domain tracking, the robot encoder tracks motions such as dancing, running, and walking. For out-of-domain teleoperation, the team uses an optical motion-capture system to obtain end-effector commands from human operators. The average real-world results are close: the robot-joint policy reports 41.63 mm, 0.095 rad, 0.260 m/s, while the end-effector policy reports 43.22 mm, 0.105 rad, 0.268 m/s. The sparse interface is slightly less accurate but much more practical for teleoperation.

Real-world teleoperation with optical motion capture: bending, raising hands, walking, jogging, boxing, squatting, and pushing objects — source: M3imic arXiv paper

Results worth remembering

On the LAFAN1 + 100STYLE training datasets, M3imic reports:

Method	Success	MPKPE	MPJAE	Velocity error
HOVER	87.35%	128.20 mm	0.686 rad	0.481 m/s
ExBody2	98.12%	53.25 mm	0.146 rad	0.285 m/s
OmniH2O	97.78%	62.75 mm	0.154 rad	0.307 m/s
TWIST2	98.89%	51.65 mm	0.121 rad	0.267 m/s
M3imic	99.54%	46.05 mm	0.112 rad	0.256 m/s

On the unseen OMOMO test set:

Policy	Success	MPKPE	MPJAE	Velocity error
M3imic robot-joint `pi^r`	95.98%	71.52 mm	0.139 rad	0.341 m/s
M3imic human-pose `pi^h`	95.23%	72.21 mm	0.140 rad	0.339 m/s
M3imic end-effector `pi^e`	98.42%	75.52 mm	0.142 rad	0.337 m/s

The practical takeaway is clear: dense robot references improve tracking fidelity, while sparse end-effector references can improve robustness under distribution shift. This matters for teleoperation and VLA-style systems. If a high-level model only knows that a hand should move to a target pose, forcing it through rigid IK into a full-body joint reference may remove useful balance freedom. A learned low-level controller with a softer command interface can be a better abstraction.

t-SNE visualization of M3imic latent space for different 100STYLE motion categories, showing that the encoder separates motion groups — source: M3imic arXiv paper

Practical checklist

For a small lab reproduction, use this order:

Install Isaac Sim 4.5 and Isaac Lab at the required commit.
Install the M3imic package with pip install -e.
Run scripts/tools/list_envs.py.
Download the preprocessed dataset and place it under datasets/extended_datasets.
Start with --num_envs 256 --max_iterations 1000.
Record a short video and check that G1 spawns correctly and does not saturate actions.
Scale to 2048-4096 environments when the debug run is stable.
Use play.py to export TorchScript and ONNX.
Move toward sim-to-sim and hardware only after checking joint order, control rate, PD gains, action scale, and emergency-stop procedures.

M3imic is not a shortcut around robotics fundamentals. It is a better way to organize representation learning and RL for whole-body control. The deployment risks remain concrete: wrong joint mapping, simulator delay mismatch, torque limits, network jitter, unsafe reference motions, and insufficient hardware safeguards.

What problem does M3imic solve?

M3imic concept: instead of relying on IK or multi-stage distillation, the method learns a shared latent command space for heterogeneous motion references — source: M3imic arXiv paper

Beginner mental model

Imagine that you are teaching a G1 to imitate motion and you have three instructors:

Modality	Input signal	Why it matters
Robot joint angles	29 G1 joint angles	High-fidelity tracking when motions are already retargeted
Human body pose	SMPL-X, 21 joints, 6D rotations	Reusing large-scale human motion datasets
End-effector poses	5 SE(3) keypoints: hands, feet, chest	Teleoperation and sparse mocap/VR commands

In plain terms:

reference sequence
  -> robot encoder / human encoder / keypoint encoder
  -> latent z, 64 dimensions
  -> actor(policy) + robot proprioception
  -> joint position action for Unitree G1

Architecture of M3imic

The full architecture has four important pieces: data preprocessing, multi-modal command encoding, asymmetric actor-critic learning, and adaptive curriculum sampling.

M3imic framework: preprocess motion data, encode multiple modalities into a shared latent command, train one policy, and deploy it with different input modalities — source: M3imic arXiv paper

Latent alignment is the central representation-learning mechanism. The autoencoder objective includes:

Loss	Practical meaning
Reconstruction loss	Each encoder-decoder pair must preserve information from its original modality
Alignment loss	Latents from robot, human, and keypoint inputs for the same motion should be close
Consistency loss	Decoding different modality latents back to robot references should be consistent

Installing the Isaac Lab environment

conda create -n env_mimic python=3.10
conda activate env_mimic
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu128

Install Isaac Sim through NVIDIA's pip index:

pip install 'isaacsim[all,extscache]==4.5.0' --extra-index-url https://pypi.nvidia.com
isaacsim isaacsim.exp.full.kit

Once the Isaac Sim GUI starts correctly, install Isaac Lab at the commit specified in the repo:

git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab
git checkout 90b79bb2d44feb8d833f260f2bf37da3487180ba
./isaaclab.sh -i
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Ant-v0 --headless

The Ant task is just a smoke test. If it runs, your Isaac Lab Python environment, extensions, simulator launch path, and GPU setup are in reasonable shape.

Then install RSL-RL and the M3imic package:

cd /path/to/IsaacLab
./isaaclab.sh -p -m pip install -e /path/to/MultiModalWBC/third_party/rsl_rl

cd /path/to/MultiModalWBC/source/whole_body_control
pip install -e .

The environment also needs Unitree assets:

git clone https://huggingface.co/datasets/unitreerobotics/unitree_model

Dataset layout

The README points to preprocessed data with SMPL-X and keypoints:

https://www.modelscope.cn/datasets/seulzx/gae_mimic_dataset

Place the unzipped data under datasets/. The package defines two dataset roots:

datasets/npz_datasets
datasets/extended_datasets

The public GAEMimic_G1FlatEnvCfg defaults to:

datasets/extended_datasets/lafan1_dataset
split: train

There is also a commented path for 100style_dataset. For a first run, keep the dataset small and confirm that the task registry works:

python scripts/tools/list_envs.py

The README lists three task families:

Task family	Purpose
`TrackingEnvCfg`	Single reference motion tracking
`MultiTracking_TrackingEnvCfg`	Multi-motion tracking from several motion clips
`GAEMimic_TrackingEnvCfg`	Multi-modal imitation with robot, SMPL-X, and SE(3) keypoints

For a basic pipeline test, the README starts with:

python scripts/rsl_rl/train.py \
  --headless \
  --task MultiTracking-Flat-G1-v0

Training the policy

The G1 PPO runner in the repo uses:

Parameter	Repo value
`num_steps_per_env`	24
`max_iterations`	30,000 for single tracking, 50,000 for multi/GAEMimic
Actor hidden dims	`[512, 256, 128]`
Critic hidden dims	`[512, 256, 128]`
Activation	`elu`
Learning rate	`1e-3`, adaptive schedule
`gamma`, `lam`	0.99 and 0.95
`desired_kl`	0.01
Entropy coefficient	0.005

The GAEMimic configuration uses RslRl_Triple_AE_PPOPolicyCfg. The important signal dimensions are:

robot signal dim:     290
human signal dim:     1260  # 126 x 10 frames
keypoints signal dim: 450   # 5 keypoints x 9 dims x 10 frames
latent dim:           64

Start with a small debug run:

python scripts/rsl_rl/train.py \
  --headless \
  --task MultiTracking-Flat-G1-v0 \
  --num_envs 256 \
  --max_iterations 1000 \
  --video \
  --video_interval 500

The reward terms include tracking and regularization:

Reward group	Meaning
Root velocity/orientation	Track root motion and heading
Body position/orientation	Track major link poses
Body velocity/angular velocity	Track dynamic motion, not just static pose
Action rate penalty	Reduce jerky control
Joint limit penalty	Avoid soft joint-limit violations
Collision penalty	Penalize unwanted collisions

Inference and policy export

After training, checkpoints are stored under:

logs/rsl_rl/<experiment_name>/<run_id>/

Run inference with play.py:

python scripts/rsl_rl/play.py \
  --task MultiTracking-Flat-G1-v0 \
  --num_envs 16 \
  --checkpoint /path/to/model.pt \
  --video \
  --video_length 400

The script loads the checkpoint with OnPolicyRunner, obtains the inference policy, and exports both TorchScript and ONNX:

<run_dir>/exported/policy.pt
<run_dir>/exported/policy.onnx

That export is the beginning of deployment, not the end. Before running on a real G1, use a staged validation path:

Sim replay: confirm the policy runs in Isaac Lab without falling, action saturation, or unstable contacts.
Sim-to-sim: test in another simulator if available, and verify joint order, PD gains, action scale, and control frequency.
Low-power hardware test: run standing and small-amplitude motions first, with an emergency stop and full logging.
Teleoperation test: increase command complexity only after sparse commands do not cause drift, crouching, or unstable foot contacts.

Real-world teleoperation with optical motion capture: bending, raising hands, walking, jogging, boxing, squatting, and pushing objects — source: M3imic arXiv paper

Results worth remembering

On the LAFAN1 + 100STYLE training datasets, M3imic reports:

Method	Success	MPKPE	MPJAE	Velocity error
HOVER	87.35%	128.20 mm	0.686 rad	0.481 m/s
ExBody2	98.12%	53.25 mm	0.146 rad	0.285 m/s
OmniH2O	97.78%	62.75 mm	0.154 rad	0.307 m/s
TWIST2	98.89%	51.65 mm	0.121 rad	0.267 m/s
M3imic	99.54%	46.05 mm	0.112 rad	0.256 m/s

On the unseen OMOMO test set:

Policy	Success	MPKPE	MPJAE	Velocity error
M3imic robot-joint `pi^r`	95.98%	71.52 mm	0.139 rad	0.341 m/s
M3imic human-pose `pi^h`	95.23%	72.21 mm	0.140 rad	0.339 m/s
M3imic end-effector `pi^e`	98.42%	75.52 mm	0.142 rad	0.337 m/s

t-SNE visualization of M3imic latent space for different 100STYLE motion categories, showing that the encoder separates motion groups — source: M3imic arXiv paper

Practical checklist

For a small lab reproduction, use this order:

Install Isaac Sim 4.5 and Isaac Lab at the required commit.
Install the M3imic package with pip install -e.
Run scripts/tools/list_envs.py.
Download the preprocessed dataset and place it under datasets/extended_datasets.
Start with --num_envs 256 --max_iterations 1000.
Record a short video and check that G1 spawns correctly and does not saturate actions.
Scale to 2048-4096 environments when the debug run is stable.
Use play.py to export TorchScript and ONNX.
Move toward sim-to-sim and hardware only after checking joint order, control rate, PD gains, action scale, and emergency-stop procedures.

M3imic: Multimodal WBC for G1

What problem does M3imic solve?

Beginner mental model

Architecture of M3imic

Installing the Isaac Lab environment

Dataset layout

Training the policy

Inference and policy export

Results worth remembering

Practical checklist

Nguyễn Anh Tuấn

Related Posts

Kiến trúc SONIC cho WBC humanoid

ASAP: train agile skills cho Unitree G1

OASIS: Thu thập dữ liệu humanoid trong Isaac Lab

M3imic: Multimodal WBC for G1

What problem does M3imic solve?

Beginner mental model

Architecture of M3imic

Installing the Isaac Lab environment

Dataset layout

Training the policy

Inference and policy export

Results worth remembering

Practical checklist

Nguyễn Anh Tuấn

Related Posts

Kiến trúc SONIC cho WBC humanoid

ASAP: train agile skills cho Unitree G1

OASIS: Thu thập dữ liệu humanoid trong Isaac Lab