Whole-body VLA: combining UMI + mocap/VR for full-body humanoid data

This is the final post in the UMI + VLA series. Earlier posts (2–5) took you from printing a gripper to a working bimanual policy. This post extends to the bigger question: how do you collect data for a humanoid that needs to both walk and manipulate?

Before reading: This is not a step-by-step tutorial. No turnkey pipeline for whole-body loco-manipulation data collection exists at the time of writing. This is an architecture and design principles post — to help you understand the landscape, choose a direction, and avoid critical design mistakes.

The problem: why upper-body UMI isn't enough for humanoids

For a standard robot arm (Franka, UR5), UMI is sufficient: collect hand demos, train policy, deploy to arm. Humanoids are more complex because they have two simultaneous problems:

Problem 1: Manipulation (arms + hands)
   → UMI handles this well

Problem 2: Locomotion + balance (legs + torso)
   → UMI does not capture this
   
Problem 3: Coordination (whole body moving while arms manipulate)
   → This is the hardest part; no one fully solves it yet

When a humanoid needs to walk to a cabinet, open a drawer, and pick up an object — the policy needs to know where to place the feet, how to tilt the torso, and how to extend the arms. These three actions are mutually dependent and must be captured simultaneously.

Current landscape: who's doing what

HumanPlus (Berkeley, 2024)

Approach: Person wears a mocap suit (21-marker setup) with a body suit and robot-similar gripper. Policy learns whole-body from human motion capture.

Strengths: Natural data collection, diverse tasks, fast to operate.

Weaknesses: Expensive mocap suit required, retargeting human kinematics to humanoid URDF is non-trivial (different body proportions), gripper doesn't fully match robot.

UMI connection: HumanPlus is "UMI for the whole body" — same philosophy (collect data from human, not robot teleop), but scaled to whole body with a mocap suit.

OmniH2O (2024)

Approach: Teleop via HMD (Meta Quest) + wrist tracking + finger tracking. Human in VR controls humanoid in real time via retargeting.

Strengths: No mocap room needed; remote teleoperation; natural intuitive control; more flexible than mocap.

Weaknesses: Latency from VR headset to robot must be <50ms; requires real-time robot SDK; more complex setup.

Open-TeleVision / Apple Vision Pro teleop

Approach: Uses Apple Vision Pro or equivalent to track head, hands, fingers in 3D. Retargets to robot arm/hands in real time.

Strengths: Consumer hardware getting cheaper; detailed finger tracking; natural.

Weaknesses: Usually covers upper body only (arms + hands), no leg capture. Apple Vision Pro expensive.

ACT / Mobile-ALOHA (Stanford)

Approach: Direct robot arm teleop via leader-follower setup with bilateral control.

Strengths: Ground truth — data from real robot, no retargeting needed.

Weaknesses: Real robot occupied; slow; expensive; doesn't scale easily.

Proposed architecture: UMI for upper body + tracking for lower body

If you want to build a whole-body data collection pipeline for a humanoid, here's the most logical architecture based on the systems above:

UPPER BODY (arms + hands):
  2x UMI handheld gripper
    → GoPro/D405 observation
    → External tracker (mocap/VR) for wrist 6DoF pose
    → Gripper width
  
LOWER BODY + TORSO (legs + waist):
  Mocap suit or VR body tracking
    → Pelvis position + orientation
    → Knee/hip/ankle poses
    → Spine/shoulder orientation
  
SYNC LAYER:
  Single host machine or NTP-sync with <5ms offset
  Shared timestamp reference
  Sync event at demo start (LED flash, force plate trigger)
  
DATA:
  left_wrist_pose, right_wrist_pose (from UMI tracker)
  left_rgb_d, right_rgb_d (from UMI camera)
  left_gripper_width, right_gripper_width
  pelvis_pose, spine_orientation
  left_hip/knee/ankle_pose, right_hip/knee/ankle_pose
  language_instruction
  timestamp

Retargeting: the biggest technical challenge

Whether you use mocap or VR, the captured data is human motion — not robot motion. You need retargeting: mapping from the human kinematic chain to the robot URDF.

This is where many projects fail because:

1. Different body proportions. A 1.7m person has a 175cm arm span. A robot may be 1.5m tall with a 130cm span. Simple scaling doesn't work — you need IK-constrained retargeting.

2. Different joint limits. A human hip has ~120° ROM in many directions. A robot hip is typically much more limited. Retargeting must project human motion into the feasible robot joint space.

3. Balance constraints. When a human bends over, they automatically adjust their CoM. A robot must solve a balance optimization problem. Copying human pose directly will make the robot fall.

4. Gripper ≠ hand. A UMI gripper has 1 DoF (open/close). A human hand has 21 DoF. You must decide how to project hand motion onto the gripper signal.

Retargeting tools to know:

Pinocchio — rigid body dynamics and IK
Pink — prioritized IK
URDF parsers: yourdfpy, robot_descriptions

Practical equipment options

Option A: SteamVR + wrist trackers (budget)

Hardware:
  - 2x SteamVR base stations (~$200 each)
  - 2x Valve Index controllers or SteamVR trackers (~$150 each)
  - Optional: Vive body trackers for hip/chest/feet (~$100 each)
  
Estimated cost: $1,000–$2,000
Accuracy: 1–3mm position, 0.1° rotation
Coverage: wrist (required) + optional body

Best for: labs without mocap, limited budget, high flexibility.

Option B: Mocap system (lab)

Hardware:
  - 6–12 OptiTrack cameras ($3,000–$15,000)
  - Passive marker clusters per body segment
  - Motive software license
  
Cost: $15,000–$50,000
Accuracy: <0.5mm
Coverage: full body, many markers

Best for: research labs, ground truth needed, multiple concurrent subjects.

Option C: Apple Vision Pro / Meta Quest (upper body only)

Hardware:
  - Apple Vision Pro ($3,500) or Meta Quest 3 ($500)
  - Hand tracking built-in
  
Coverage: hands + arms only, no legs

Best for: upper body only data, fast setup.

Whole-body policy architecture

With data in hand, a whole-body policy is more complex than arm-only:

State space (example 24-DoF humanoid):
  - left/right arm: 7 DoF each = 14
  - left/right gripper: 1 each = 2  
  - spine: 3 DoF = 3
  - left/right leg: 6 DoF each = 12
  Total: ~31 DoF
  
Observation space:
  - left/right wrist RGB-D (from UMI cameras)
  - head camera (optional)
  - proprioception (joint positions, velocities, IMU)
  - language instruction
  
Action space:
  - whole-body joint targets (31 DoF), or
  - end-effector + CoM trajectory (decoupled)

Decoupled vs end-to-end:

Two main approaches:

Decoupled (easier to debug): Separate upper body policy (manipulation) + lower body controller (locomotion). Run in parallel, connected via shared state. HumanPlus uses this approach.
End-to-end (harder, potentially better): One model learns both locomotion and manipulation jointly. Better coordination in principle, but needs more data and is harder to debug.

If you're starting out: decoupled first, then try end-to-end once you have working baselines for each part.

What you realistically need to prepare

[ ] Humanoid robot with full SDK (joint control, FK/IK, safety)
[ ] Retargeting pipeline from human kinematics to robot URDF
[ ] Whole-body balance controller (standalone, tested independently)
[ ] Working upper body UMI pipeline (Parts 2–5 of this series)
[ ] Lower body tracking solution (mocap/VR)
[ ] Multi-modal synchronized recorder
[ ] Safety: E-stop, joint limits, collision detection, velocity limits
[ ] Compute: robot onboard + policy server + tracker PC

Realistic timeline:

Complete upper body UMI (Parts 2–5): 2–4 weeks
Add lower body tracking: 2–4 weeks
Retargeting + safety: 4–8 weeks
Data collection + training + testing: 4–8 weeks

Minimum viable path: if you don't have a whole-body robot yet, start with upper body UMI on two robot arms. Have a working bimanual policy first, then scale to humanoid.

Conclusion

UMI handles upper body manipulation data collection well. Whole-body loco-manipulation is the natural next step but requires three additional components: lower body tracking, a retargeting pipeline, and a balance controller.

No turnkey solution exists at the time of writing. But the architecture is clear, the tools are available (Pinocchio, SteamVR, mocap), and research is advancing rapidly. If you have a working bimanual policy from Parts 2–5, you have the best possible foundation to build on.

References

HumanPlus (2024) — "Humanoid Shadowing and Imitation from Observations"
OmniH2O (2024) — "Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation"
Open-TeleVision — TV-based teleoperation with hand tracking
real-stanford/universal_manipulation_interface
Pinocchio
Pink: prioritized IK

Whole-body VLA: combining UMI + mocap/VR for full-body humanoid data

Before reading: This is not a step-by-step tutorial. No turnkey pipeline for whole-body loco-manipulation data collection exists at the time of writing. This is an architecture and design principles post — to help you understand the landscape, choose a direction, and avoid critical design mistakes.

The problem: why upper-body UMI isn't enough for humanoids

For a standard robot arm (Franka, UR5), UMI is sufficient: collect hand demos, train policy, deploy to arm. Humanoids are more complex because they have two simultaneous problems:

Problem 1: Manipulation (arms + hands)
   → UMI handles this well

Problem 2: Locomotion + balance (legs + torso)
   → UMI does not capture this
   
Problem 3: Coordination (whole body moving while arms manipulate)
   → This is the hardest part; no one fully solves it yet

Current landscape: who's doing what

HumanPlus (Berkeley, 2024)

Approach: Person wears a mocap suit (21-marker setup) with a body suit and robot-similar gripper. Policy learns whole-body from human motion capture.

Strengths: Natural data collection, diverse tasks, fast to operate.

Weaknesses: Expensive mocap suit required, retargeting human kinematics to humanoid URDF is non-trivial (different body proportions), gripper doesn't fully match robot.

UMI connection: HumanPlus is "UMI for the whole body" — same philosophy (collect data from human, not robot teleop), but scaled to whole body with a mocap suit.

OmniH2O (2024)

Approach: Teleop via HMD (Meta Quest) + wrist tracking + finger tracking. Human in VR controls humanoid in real time via retargeting.

Strengths: No mocap room needed; remote teleoperation; natural intuitive control; more flexible than mocap.

Weaknesses: Latency from VR headset to robot must be <50ms; requires real-time robot SDK; more complex setup.

Open-TeleVision / Apple Vision Pro teleop

Approach: Uses Apple Vision Pro or equivalent to track head, hands, fingers in 3D. Retargets to robot arm/hands in real time.

Strengths: Consumer hardware getting cheaper; detailed finger tracking; natural.

Weaknesses: Usually covers upper body only (arms + hands), no leg capture. Apple Vision Pro expensive.

ACT / Mobile-ALOHA (Stanford)

Approach: Direct robot arm teleop via leader-follower setup with bilateral control.

Strengths: Ground truth — data from real robot, no retargeting needed.

Weaknesses: Real robot occupied; slow; expensive; doesn't scale easily.

Proposed architecture: UMI for upper body + tracking for lower body

If you want to build a whole-body data collection pipeline for a humanoid, here's the most logical architecture based on the systems above:

UPPER BODY (arms + hands):
  2x UMI handheld gripper
    → GoPro/D405 observation
    → External tracker (mocap/VR) for wrist 6DoF pose
    → Gripper width
  
LOWER BODY + TORSO (legs + waist):
  Mocap suit or VR body tracking
    → Pelvis position + orientation
    → Knee/hip/ankle poses
    → Spine/shoulder orientation
  
SYNC LAYER:
  Single host machine or NTP-sync with <5ms offset
  Shared timestamp reference
  Sync event at demo start (LED flash, force plate trigger)
  
DATA:
  left_wrist_pose, right_wrist_pose (from UMI tracker)
  left_rgb_d, right_rgb_d (from UMI camera)
  left_gripper_width, right_gripper_width
  pelvis_pose, spine_orientation
  left_hip/knee/ankle_pose, right_hip/knee/ankle_pose
  language_instruction
  timestamp

Retargeting: the biggest technical challenge

Whether you use mocap or VR, the captured data is human motion — not robot motion. You need retargeting: mapping from the human kinematic chain to the robot URDF.

This is where many projects fail because:

1. Different body proportions. A 1.7m person has a 175cm arm span. A robot may be 1.5m tall with a 130cm span. Simple scaling doesn't work — you need IK-constrained retargeting.

2. Different joint limits. A human hip has ~120° ROM in many directions. A robot hip is typically much more limited. Retargeting must project human motion into the feasible robot joint space.

3. Balance constraints. When a human bends over, they automatically adjust their CoM. A robot must solve a balance optimization problem. Copying human pose directly will make the robot fall.

4. Gripper ≠ hand. A UMI gripper has 1 DoF (open/close). A human hand has 21 DoF. You must decide how to project hand motion onto the gripper signal.

Retargeting tools to know:

Pinocchio — rigid body dynamics and IK
Pink — prioritized IK
URDF parsers: yourdfpy, robot_descriptions

Practical equipment options

Option A: SteamVR + wrist trackers (budget)

Hardware:
  - 2x SteamVR base stations (~$200 each)
  - 2x Valve Index controllers or SteamVR trackers (~$150 each)
  - Optional: Vive body trackers for hip/chest/feet (~$100 each)
  
Estimated cost: $1,000–$2,000
Accuracy: 1–3mm position, 0.1° rotation
Coverage: wrist (required) + optional body

Best for: labs without mocap, limited budget, high flexibility.

Option B: Mocap system (lab)

Hardware:
  - 6–12 OptiTrack cameras ($3,000–$15,000)
  - Passive marker clusters per body segment
  - Motive software license
  
Cost: $15,000–$50,000
Accuracy: <0.5mm
Coverage: full body, many markers

Best for: research labs, ground truth needed, multiple concurrent subjects.

Option C: Apple Vision Pro / Meta Quest (upper body only)

Hardware:
  - Apple Vision Pro ($3,500) or Meta Quest 3 ($500)
  - Hand tracking built-in
  
Coverage: hands + arms only, no legs

Best for: upper body only data, fast setup.

Whole-body policy architecture

With data in hand, a whole-body policy is more complex than arm-only:

State space (example 24-DoF humanoid):
  - left/right arm: 7 DoF each = 14
  - left/right gripper: 1 each = 2  
  - spine: 3 DoF = 3
  - left/right leg: 6 DoF each = 12
  Total: ~31 DoF
  
Observation space:
  - left/right wrist RGB-D (from UMI cameras)
  - head camera (optional)
  - proprioception (joint positions, velocities, IMU)
  - language instruction
  
Action space:
  - whole-body joint targets (31 DoF), or
  - end-effector + CoM trajectory (decoupled)

Decoupled vs end-to-end:

Two main approaches:

Decoupled (easier to debug): Separate upper body policy (manipulation) + lower body controller (locomotion). Run in parallel, connected via shared state. HumanPlus uses this approach.
End-to-end (harder, potentially better): One model learns both locomotion and manipulation jointly. Better coordination in principle, but needs more data and is harder to debug.

If you're starting out: decoupled first, then try end-to-end once you have working baselines for each part.

What you realistically need to prepare

[ ] Humanoid robot with full SDK (joint control, FK/IK, safety)
[ ] Retargeting pipeline from human kinematics to robot URDF
[ ] Whole-body balance controller (standalone, tested independently)
[ ] Working upper body UMI pipeline (Parts 2–5 of this series)
[ ] Lower body tracking solution (mocap/VR)
[ ] Multi-modal synchronized recorder
[ ] Safety: E-stop, joint limits, collision detection, velocity limits
[ ] Compute: robot onboard + policy server + tracker PC

Realistic timeline:

Complete upper body UMI (Parts 2–5): 2–4 weeks
Add lower body tracking: 2–4 weeks
Retargeting + safety: 4–8 weeks
Data collection + training + testing: 4–8 weeks

Minimum viable path: if you don't have a whole-body robot yet, start with upper body UMI on two robot arms. Have a working bimanual policy first, then scale to humanoid.

Conclusion

References

HumanPlus (2024) — "Humanoid Shadowing and Imitation from Observations"
OmniH2O (2024) — "Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation"
Open-TeleVision — TV-based teleoperation with hand tracking
real-stanford/universal_manipulation_interface
Pinocchio
Pink: prioritized IK

Whole-body VLA: combining UMI + mocap/VR for full-body humanoid data

The problem: why upper-body UMI isn't enough for humanoids

Current landscape: who's doing what

HumanPlus (Berkeley, 2024)

OmniH2O (2024)

Open-TeleVision / Apple Vision Pro teleop

ACT / Mobile-ALOHA (Stanford)

Proposed architecture: UMI for upper body + tracking for lower body

Retargeting: the biggest technical challenge

Practical equipment options

Option A: SteamVR + wrist trackers (budget)

Option B: Mocap system (lab)

Option C: Apple Vision Pro / Meta Quest (upper body only)

Whole-body policy architecture

What you realistically need to prepare

Conclusion

References

Related posts

Nguyễn Anh Tuấn

Related Posts

Teleop VR: từ PICO/ZED đến HDF5

unifolm-vla + Unitree G1 (Bài 5): deploy inference server, SSH tunnel, và locomotion song song

VLA + WBC repos từ Trung Quốc: Unitree, THU RDT-1B, và cộng đồng mở

Whole-body VLA: combining UMI + mocap/VR for full-body humanoid data

The problem: why upper-body UMI isn't enough for humanoids

Current landscape: who's doing what

HumanPlus (Berkeley, 2024)

OmniH2O (2024)

Open-TeleVision / Apple Vision Pro teleop

ACT / Mobile-ALOHA (Stanford)

Proposed architecture: UMI for upper body + tracking for lower body

Retargeting: the biggest technical challenge

Practical equipment options

Option A: SteamVR + wrist trackers (budget)

Option B: Mocap system (lab)

Option C: Apple Vision Pro / Meta Quest (upper body only)

Whole-body policy architecture

What you realistically need to prepare

Conclusion

References

Related posts

Nguyễn Anh Tuấn

Related Posts

Teleop VR: từ PICO/ZED đến HDF5

unifolm-vla + Unitree G1 (Bài 5): deploy inference server, SSH tunnel, và locomotion song song

VLA + WBC repos từ Trung Quốc: Unitree, THU RDT-1B, và cộng đồng mở