Whole-body VLA: combining UMI + mocap/VR for full-body humanoid data
This is the final post in the UMI + VLA series. Earlier posts (2–5) took you from printing a gripper to a working bimanual policy. This post extends to the bigger question: how do you collect data for a humanoid that needs to both walk and manipulate?
Before reading: This is not a step-by-step tutorial. No turnkey pipeline for whole-body loco-manipulation data collection exists at the time of writing. This is an architecture and design principles post — to help you understand the landscape, choose a direction, and avoid critical design mistakes.
The problem: why upper-body UMI isn't enough for humanoids
For a standard robot arm (Franka, UR5), UMI is sufficient: collect hand demos, train policy, deploy to arm. Humanoids are more complex because they have two simultaneous problems:
Problem 1: Manipulation (arms + hands)
→ UMI handles this well
Problem 2: Locomotion + balance (legs + torso)
→ UMI does not capture this
Problem 3: Coordination (whole body moving while arms manipulate)
→ This is the hardest part; no one fully solves it yet
When a humanoid needs to walk to a cabinet, open a drawer, and pick up an object — the policy needs to know where to place the feet, how to tilt the torso, and how to extend the arms. These three actions are mutually dependent and must be captured simultaneously.
Current landscape: who's doing what
HumanPlus (Berkeley, 2024)
Approach: Person wears a mocap suit (21-marker setup) with a body suit and robot-similar gripper. Policy learns whole-body from human motion capture.
Strengths: Natural data collection, diverse tasks, fast to operate.
Weaknesses: Expensive mocap suit required, retargeting human kinematics to humanoid URDF is non-trivial (different body proportions), gripper doesn't fully match robot.
UMI connection: HumanPlus is "UMI for the whole body" — same philosophy (collect data from human, not robot teleop), but scaled to whole body with a mocap suit.
OmniH2O (2024)
Approach: Teleop via HMD (Meta Quest) + wrist tracking + finger tracking. Human in VR controls humanoid in real time via retargeting.
Strengths: No mocap room needed; remote teleoperation; natural intuitive control; more flexible than mocap.
Weaknesses: Latency from VR headset to robot must be <50ms; requires real-time robot SDK; more complex setup.
Open-TeleVision / Apple Vision Pro teleop
Approach: Uses Apple Vision Pro or equivalent to track head, hands, fingers in 3D. Retargets to robot arm/hands in real time.
Strengths: Consumer hardware getting cheaper; detailed finger tracking; natural.
Weaknesses: Usually covers upper body only (arms + hands), no leg capture. Apple Vision Pro expensive.
ACT / Mobile-ALOHA (Stanford)
Approach: Direct robot arm teleop via leader-follower setup with bilateral control.
Strengths: Ground truth — data from real robot, no retargeting needed.
Weaknesses: Real robot occupied; slow; expensive; doesn't scale easily.
Proposed architecture: UMI for upper body + tracking for lower body
If you want to build a whole-body data collection pipeline for a humanoid, here's the most logical architecture based on the systems above:
UPPER BODY (arms + hands):
2x UMI handheld gripper
→ GoPro/D405 observation
→ External tracker (mocap/VR) for wrist 6DoF pose
→ Gripper width
LOWER BODY + TORSO (legs + waist):
Mocap suit or VR body tracking
→ Pelvis position + orientation
→ Knee/hip/ankle poses
→ Spine/shoulder orientation
SYNC LAYER:
Single host machine or NTP-sync with <5ms offset
Shared timestamp reference
Sync event at demo start (LED flash, force plate trigger)
DATA:
left_wrist_pose, right_wrist_pose (from UMI tracker)
left_rgb_d, right_rgb_d (from UMI camera)
left_gripper_width, right_gripper_width
pelvis_pose, spine_orientation
left_hip/knee/ankle_pose, right_hip/knee/ankle_pose
language_instruction
timestamp
Retargeting: the biggest technical challenge
Whether you use mocap or VR, the captured data is human motion — not robot motion. You need retargeting: mapping from the human kinematic chain to the robot URDF.
This is where many projects fail because:
1. Different body proportions. A 1.7m person has a 175cm arm span. A robot may be 1.5m tall with a 130cm span. Simple scaling doesn't work — you need IK-constrained retargeting.
2. Different joint limits. A human hip has ~120° ROM in many directions. A robot hip is typically much more limited. Retargeting must project human motion into the feasible robot joint space.
3. Balance constraints. When a human bends over, they automatically adjust their CoM. A robot must solve a balance optimization problem. Copying human pose directly will make the robot fall.
4. Gripper ≠ hand. A UMI gripper has 1 DoF (open/close). A human hand has 21 DoF. You must decide how to project hand motion onto the gripper signal.
Retargeting tools to know:
- Pinocchio — rigid body dynamics and IK
- Pink — prioritized IK
- URDF parsers:
yourdfpy,robot_descriptions
Practical equipment options
Option A: SteamVR + wrist trackers (budget)
Hardware:
- 2x SteamVR base stations (~$200 each)
- 2x Valve Index controllers or SteamVR trackers (~$150 each)
- Optional: Vive body trackers for hip/chest/feet (~$100 each)
Estimated cost: $1,000–$2,000
Accuracy: 1–3mm position, 0.1° rotation
Coverage: wrist (required) + optional body
Best for: labs without mocap, limited budget, high flexibility.
Option B: Mocap system (lab)
Hardware:
- 6–12 OptiTrack cameras ($3,000–$15,000)
- Passive marker clusters per body segment
- Motive software license
Cost: $15,000–$50,000
Accuracy: <0.5mm
Coverage: full body, many markers
Best for: research labs, ground truth needed, multiple concurrent subjects.
Option C: Apple Vision Pro / Meta Quest (upper body only)
Hardware:
- Apple Vision Pro ($3,500) or Meta Quest 3 ($500)
- Hand tracking built-in
Coverage: hands + arms only, no legs
Best for: upper body only data, fast setup.
Whole-body policy architecture
With data in hand, a whole-body policy is more complex than arm-only:
State space (example 24-DoF humanoid):
- left/right arm: 7 DoF each = 14
- left/right gripper: 1 each = 2
- spine: 3 DoF = 3
- left/right leg: 6 DoF each = 12
Total: ~31 DoF
Observation space:
- left/right wrist RGB-D (from UMI cameras)
- head camera (optional)
- proprioception (joint positions, velocities, IMU)
- language instruction
Action space:
- whole-body joint targets (31 DoF), or
- end-effector + CoM trajectory (decoupled)
Decoupled vs end-to-end:
Two main approaches:
-
Decoupled (easier to debug): Separate upper body policy (manipulation) + lower body controller (locomotion). Run in parallel, connected via shared state. HumanPlus uses this approach.
-
End-to-end (harder, potentially better): One model learns both locomotion and manipulation jointly. Better coordination in principle, but needs more data and is harder to debug.
If you're starting out: decoupled first, then try end-to-end once you have working baselines for each part.
What you realistically need to prepare
[ ] Humanoid robot with full SDK (joint control, FK/IK, safety)
[ ] Retargeting pipeline from human kinematics to robot URDF
[ ] Whole-body balance controller (standalone, tested independently)
[ ] Working upper body UMI pipeline (Parts 2–5 of this series)
[ ] Lower body tracking solution (mocap/VR)
[ ] Multi-modal synchronized recorder
[ ] Safety: E-stop, joint limits, collision detection, velocity limits
[ ] Compute: robot onboard + policy server + tracker PC
Realistic timeline:
- Complete upper body UMI (Parts 2–5): 2–4 weeks
- Add lower body tracking: 2–4 weeks
- Retargeting + safety: 4–8 weeks
- Data collection + training + testing: 4–8 weeks
Minimum viable path: if you don't have a whole-body robot yet, start with upper body UMI on two robot arms. Have a working bimanual policy first, then scale to humanoid.
Conclusion
UMI handles upper body manipulation data collection well. Whole-body loco-manipulation is the natural next step but requires three additional components: lower body tracking, a retargeting pipeline, and a balance controller.
No turnkey solution exists at the time of writing. But the architecture is clear, the tools are available (Pinocchio, SteamVR, mocap), and research is advancing rapidly. If you have a working bimanual policy from Parts 2–5, you have the best possible foundation to build on.
References
- HumanPlus (2024) — "Humanoid Shadowing and Imitation from Observations"
- OmniH2O (2024) — "Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation"
- Open-TeleVision — TV-based teleoperation with hand tracking
- real-stanford/universal_manipulation_interface
- Pinocchio
- Pink: prioritized IK