Ego Videos for Locomotion LAMs

What This Article Is For

Article 1 mapped the WholeBodyVLA pipeline: egocentric images and language enter the VLA; two Latent Action Models (LAMs) provide discrete latent supervision; a lightweight decoder emits robot-level commands at about 10 Hz; and the LMO policy executes locomotion at 50 Hz. Article 2 asks a more practical question: if you do not yet have a Motion Capture studio, and you cannot teleoperate a humanoid for hundreds of hours, what data can you collect first?

The answer from WholeBodyVLA is useful for small labs: at the latent pretraining stage, the locomotion LAM can learn from egocentric videos without action labels. The paper describes a low-cost data recipe: one operator, one head-mounted monocular camera, humanoid-relevant motion primitives such as advancing, turning, and squatting, plus goal-directed motion toward potential manipulation or contact targets. The manipulation LAM is trained separately on real-robot manipulation data, in the paper using AgiBot World, while the locomotion LAM uses self-collected egocentric locomotion videos.

This article turns that recipe into a capture protocol a beginner can follow. It does not replace teleoperation; that stage is covered separately in article 4. But it lets you build a clean video corpus early, with the right viewpoint, primitives, metadata, and quality checks, so the model can learn a locomotion vocabulary before you spend expensive robot time.

For broader research context, see our WholeBodyVLA ICLR 2026 analysis and WholebodyVLA architecture guide. This article is the first concrete data step: turning the paper's action-free recipe into a capture workflow you can audit.

Why Action-Free Ego Video?

In robot learning, video is usually not the expensive part. The expensive part is action-aligned trajectory data: joint targets, gripper commands, base commands, contact state, robot state, and synchronized timestamps. Collecting that data requires the robot, teleoperation hardware, control software, safety supervision, scene resets, and many failed attempts. For humanoid loco-manipulation, the cost is even higher because the robot must walk, balance, and use its arms at the same time.

WholeBodyVLA uses latent action learning to reduce this bottleneck. A LAM observes two consecutive frames and learns a discrete code describing the visual change from the earlier frame to the later frame. That code is not a true motor command, but it can act as a pseudo-action label for VLA pretraining. When a head-mounted video shows a person walking toward a table, turning toward a cart, lowering their body to reach a box, or stopping near a handle, the LAM can learn visual changes corresponding to "advance," "turn," "squat," and "stop near the target."

The key caveat is that "action-free" does not mean "any video is fine." The paper stresses that the locomotion data should be manipulation-aware locomotion. The operator should not wander randomly. They should move in ways that establish preconditions for manipulation: approach the object to be picked, rotate to face a contact target, squat to reach a low container, sidestep to align before placing an object, or stop in front of a handle. This makes the video directly useful for loco-manipulation learning rather than just teaching the model that the scene moves when the camera moves.

Separate the Locomotion LAM and Manipulation LAM

WholeBodyVLA does not train one shared LAM over all videos. The paper explains that manipulation and locomotion videos generate very different visual patterns. In manipulation clips, the camera is often almost static; the important changes happen around hands, grippers, objects, and contact points. In egocentric locomotion clips, the camera moves continuously; the whole scene shifts because the head or body is advancing, turning, sidestepping, or squatting. If a single LAM is trained naively on both sources, it can learn conflicting attention behavior: sometimes it should focus locally on hands and objects, while at other times it should attend to the whole scene to understand ego-motion.

Your dataset should therefore have two buckets from the beginning:

Bucket	Primary data	What it teaches	Avoid mixing in
Locomotion LAM	Egocentric video while the person or robot moves through the scene	Advance, retreat, sidestep, turn, squat, stop near target	Clips that only show tabletop hand motion
Manipulation LAM	Object manipulation video, preferably from robot data or a robot dataset	Grasp, lift, place, push, pull, contact affordance	Long walking clips without clear manipulation
Teleop fine-tuning	Trajectories with robot state and real commands	Ground latents into joint targets and locomotion commands	Action-free video without synchronized actions

This article focuses on the locomotion LAM bucket. The capture protocol still needs to be manipulation-aware. The goal is not to make a walking dataset for sports analysis. The goal is to capture how a body approaches useful manipulation poses.

WholeBodyVLA learns locomotion and manipulation as a connected stack

Minimum Equipment

The WholeBodyVLA paper describes a locomotion data pipeline that only needs a single operator with a head-mounted monocular camera, avoiding expensive MoCap or teleoperation for this pretraining stage. In a small lab, you can start with a very simple setup:

Component	Practical minimum	Notes
Camera	Action camera, phone on a head strap, or lightweight RGB camera	Prioritize stable video and an eye-level viewpoint
Mount	Head strap, helmet mount, or head-level adapter	Head-mounted is closer to a humanoid head camera than chest-mounted
Video format	1080p at 30 fps is a reasonable start	60 fps helps with faster motion but increases storage
Audio	Optional	Useful for spoken instructions, but consider privacy
Time sync	System clock or clap marker at the start of clips	Useful later when joining metadata and scene logs
Scene props	Table, shelf, box, basket, cart, handle, objects to pick	Choose objects with clear contact meaning

You do not need RGB-D for the locomotion LAM recipe. If you have a RealSense-style camera, it can be useful, but do not let hardware requirements block the first data collection pass. For beginner workflows, the important parts are temporal order, a viewpoint similar to a robot head camera, and task-relevant motion.

Motion Primitives to Capture

WholeBodyVLA highlights core loco-manipulation primitives such as advancing, turning, and squatting. The appendix further describes task suites combining forward and backward walking, lateral stepping, in-place turning, and squatting with single-arm or dual-arm manipulation. Your protocol should cover at least these groups:

Primitive	Example clip	What a good clip shows	Common mistake
Advance	Walk from a doorway to a table with a box	Target grows larger, path is clear, final stop is near the workspace	Operator moves too fast or stops too far away
Backward	Step away from a table after placement	Context remains visible, motion is controlled	Long backward walking in an empty scene
Lateral step	Sidestep to align with a basket or carton	Target shifts laterally, then ends near the center of view	Sidestep without a visible target
Turn	Rotate in place to face a cart or shelf	Heading changes clearly, target ends centered	Operator both turns and advances too much
Squat	Lower the body to reach a low container	Camera height lowers, low target becomes easier to reach	Squat is not tied to any object
Approach-to-contact	Move toward a handle, box, shelf edge, or tabletop	Operator stops at arm-reachable distance	Operator simply passes by the target

Each clip does not need to contain only one primitive. Real loco-manipulation is sequential: approach the table, sidestep to align, squat, make a small turn, and stop. In the first data pass, collect both single-primitive clips and short sequences. Single clips help you debug whether the LAM can distinguish the basic motions; sequences help the VLA learn task context.

A minimal two-hour collection plan can look like this:

Scene A: table + box + basket on the floor
  20 clips approaching the table from 3 distances
  20 clips sidestepping left/right to align with the basket
  20 clips squatting in front of the basket
  20 clips: advance -> sidestep -> squat -> stop

Scene B: shelf + cart with handle
  20 clips turning left/right to face the handle
  20 clips approaching the handle and stopping within reach
  20 clips of short simulated pushing motion, no heavy force required
  20 clips: advance -> turn -> approach handle

Scene C: low box or floor-level object
  20 squat clips from multiple approach angles
  20 clips stepping backward after a squat
  20 clips changing heading and then squatting

Those numbers are only a starting point for pipeline testing. The WholeBodyVLA experiments used about 300 hours of collected egocentric locomotion video, so a serious training run should expand across scenes, operators, layouts, lighting conditions, and target types.

Goal-Directed Motion Toward Contact Targets

The most important part of the recipe is goal-directed execution. The operator should move toward targets that could become manipulation or contact targets. In other words, every locomotion clip should answer the question: "After stopping here, what could a humanoid do with its hands?"

Good targets include:

Contact target	Why it is useful
Table edge or tabletop	The robot must stop at a workable distance before pick/place
Basket, carton, or low box	Combines approach, lateral alignment, and squat
Cart handle	Combines turning, bimanual placement, and forward push
Door or drawer handle	Requires precise orientation before manipulation
Multi-level shelf	Requires height and distance adjustment
Object on the floor	Requires squatting while keeping the target in view

Weak clips look like this:

Walking around the room without stopping near a target.
Rotating the camera by moving the neck or hand instead of rotating the body.
Squatting in empty space.
Running through the scene so the object appears only briefly.
Capturing from a handheld camera much lower than head level.

When you add manual labels, avoid generic labels like walk. Record the relation to the target: approach_table, align_to_basket, turn_to_cart_handle, squat_to_low_box. The LAM does not need these labels for frame reconstruction training, but good metadata helps filtering, failure analysis, validation split design, and later VLA instruction generation.

A Concrete Capture Protocol

The following protocol is detailed enough to hand to a new data collector.

1. Prepare the Scene

Choose a small area with stable lighting and limited foot traffic. Place three to five contact targets: a table, a box on the table, a basket on the floor, a cart or object with a handle, and a shelf. Mark several start positions on the floor with tape: near, medium, far; left offset, right offset; and headings rotated 30-60 degrees from the target.

The aim is controlled variation. If every clip starts from the same pose, the model will overfit. If the scene changes too much on day one, debugging becomes difficult. Start structured, then add diversity.

2. Mount the Camera and Check the Field of View

The camera should sit near eye level and point slightly downward, so it can see low targets and the approximate hand workspace. If the angle is too high, the video shows mostly walls and shelves. If it is too low, the data behaves more like chest-camera video than head-camera video. Before recording a full batch, walk through one sequence and check:

Is the target centered when the operator stops?
Are the tabletop, basket, or handle recognizable?
During a squat, does the low target stay in view?
Is the video stable enough to avoid constant motion blur?
Are file timestamps and clip IDs ordered correctly?

If your camera has stabilization, use it moderately. Excessive stabilization may remove some ego-motion cues, but severe shake makes reconstruction much harder. A LAM needs real camera motion, not unusable blur.

3. Attach an Instruction to Every Clip

Each clip should have a short intent. You can speak it at the beginning, show it to the operator on a phone, or store it in a CSV file. For example:

clip_id,instruction,start_pose,target,primitive_sequence
A_001,"go to the table and stop within reach",far_center,table_edge,advance
A_014,"sidestep right to align with the basket",near_left,basket,lateral_right
B_008,"turn left to face the cart handle",near_angle_right,cart_handle,turn_left
C_022,"squat to reach the low box",near_center,low_box,squat

Instructions do not need to be long. They do need to match the video. If the instruction says "turn to the cart" but the operator walks straight to the table, you create noisy metadata that will hurt downstream filtering and evaluation.

4. Record Short Clips With Clear Starts and Stops

Single-primitive clips can be 3-12 seconds long. Short sequences can be 10-30 seconds. Begin when the camera is stable. End after the operator has stopped for one or two seconds at the final pose. That final pause matters: loco-manipulation requires not only reaching the target, but reaching a stable stance where manipulation is possible.

A simple single-primitive format:

0.0-1.0 s: stand still at the start pose
1.0-5.0 s: execute the main primitive
5.0-7.0 s: stop with the target within manipulation range

A short sequence format:

0.0-1.0 s: stand still
1.0-5.0 s: advance
5.0-8.0 s: turn or sidestep
8.0-11.0 s: squat or align
11.0-13.0 s: hold a stable final stance near the target

5. Write Metadata During the Session

Do not rely on memory after the session. Create manifest.csv or manifest.jsonl as you record. A minimal schema:

{
  "clip_id": "B_008",
  "video_path": "raw/session_2026_06_10/B_008.mp4",
  "scene_id": "shelf_cart_lab",
  "operator_id": "op_03",
  "camera_mount": "head",
  "camera_model": "action_cam_1080p30",
  "instruction": "turn left to face the cart handle",
  "target": "cart_handle",
  "primitive_sequence": ["turn_left", "stop"],
  "start_pose_bucket": "near_angle_right",
  "quality": "unchecked"
}

You do not need exact 6D pose labels at this stage. You do need enough metadata to filter clips later: keep only squat clips, remove shaky clips, split by scene, or test whether the model has overfit to one operator.

Data Quality Rules

A good capture protocol defines not only what to record, but also what to reject. Action-free videos are especially easy to pollute because there is no action label to validate. Use a concrete checklist:

Criterion	Pass	Fail
Temporal continuity	Clip is continuous, with no jump cuts	Heavy frame drops or cuts in the middle
Viewpoint	Head-mounted egocentric view	Handheld shake or camera far below eye level
Target visibility	Target is visible before the final stop	Target is occluded or only appears for an instant
Primitive clarity	Main motion is recognizable	Motion mixes advance, turn, and squat without clear intent
Final stance	1-2 seconds of stable stop	Clip ends while the camera is still moving wildly
Manipulation relevance	Stop is within reach or at a contact point	Walking unrelated to any manipulation target
Privacy and safety	No bystander faces, unsafe areas, or risky collisions	Sensitive environment or unsafe path

Use three QA labels: keep, maybe, and drop. Do not try to save every clip. Bad clips can push the LAM codebook toward camera shake, blur, lighting changes, or irrelevant movement rather than useful locomotion semantics.

Split the Dataset to Avoid Leakage

Beginners often split train and validation clips randomly. That can leak the same scene, operator, target, and route into both sets. The validation loss looks good, but the model may fail when the room, camera, or operator changes.

Better split options:

Split	How to define it	Purpose
Train	Most clips across most scenes and operators	Learn the main primitive vocabulary
In-scene validation	Same scenes, different clips	Debug reconstruction and latent quality
Held-out scene validation	Reserve one room or layout	Test visual generalization
Held-out operator validation	Reserve one operator	Test gait and head-motion bias
Stress set	Different lighting, low/high targets, offset paths	Find failures before scaling

If you only have one lab room, create held-out layouts: move the table, change box color, swap the basket, or alter start distances. WholeBodyVLA evaluates generalization under changed start poses, objects, layouts, and appearance; your locomotion dataset should follow the same spirit.

Minimal Preprocessing Before LAM Training

A LAM learns from consecutive frames, so preprocessing must preserve time. Do not shuffle individual frames. Do not crop so aggressively that the target disappears. A light pipeline:

def preprocess_clip(video):
    frames = decode_video(video, fps=10)  # or sample from 30 fps down to 10 fps
    frames = trim_static_intro_outro(frames, keep_final_stop=True)
    frames = resize_short_side(frames, 256)
    frames = center_or_letterbox_crop(frames, size=224)
    pairs = [(frames[i], frames[i + 1]) for i in range(len(frames) - 1)]
    return pairs

The sampling rate depends on the model and compute budget. If sampling is too sparse, frame-to-frame change becomes too large. If it is too dense, many pairs are almost identical. Since the WholeBodyVLA VLA backbone operates at about 10 Hz during deployment, thinking about video around 10 fps is a reasonable starting point. Still, you should test several rates and measure reconstruction or relative reconstruction gain instead of choosing by intuition.

Keep metadata next to processed frames:

processed/
  frames/
    B_008/
      000000.jpg
      000001.jpg
  pairs.jsonl
  manifest.jsonl
  qa_report.csv

Do Not Start With MoCap If the Goal Is the LAM

Motion Capture can be valuable for retargeting, imitation, whole-body motion priors, and evaluation. But if the immediate goal is to train a locomotion LAM following the WholeBodyVLA action-free recipe, MoCap is not the first requirement. The LAM needs structured visual change from egocentric video; it does not require a 3D skeleton for every frame.

A practical comparison:

Data collection path	Cost	Best stage	Risk
Head-mounted monocular video	Low	Locomotion LAM pretraining	Needs strong QA to avoid noisy clips
Robot manipulation dataset	Medium to high if self-collected, low if open	Manipulation LAM	Domain gap from your target robot
Real-robot teleoperation	High	Decoder grounding and fine-tuning	Consumes robot time and expert operators
Whole-body MoCap	Very high	Retargeting, motion prior, evaluation	Does not automatically provide ego visual latents

The practical rule is simple: on day one, record good ego video before building a MoCap-heavy pipeline. Once the LAM pipeline works, decide where MoCap actually helps: retargeting, reward design, or benchmarking. This matches the spirit of the paper: use low-cost action-free video to scale pretraining, then use targeted teleoperation to ground latents into the real robot.

Capture-Day Checklist

[ ] Scene has clear manipulation targets: table, basket, box, handle, shelf.
[ ] Camera is head-mounted, and low targets remain visible during squats.
[ ] Primitive list covers advance, backward, lateral, turn, squat, approach-to-contact.
[ ] Every clip has an instruction or intent metadata.
[ ] Every clip starts and ends with 1-2 seconds of stillness.
[ ] Manifest records clip_id, scene_id, operator_id, target, and primitive.
[ ] QA labels are defined: keep/maybe/drop.
[ ] Validation has a held-out scene, layout, or operator.
[ ] No private or unsafe areas are recorded.

Conclusion

Ego video for a locomotion LAM is cheap, but it is not casual data. To stay close to WholeBodyVLA, keep three principles: action-free videos must still be temporally clean; locomotion must be manipulation-aware, meaning the operator advances, turns, sidesteps, and squats toward contact targets; and the locomotion LAM should be separated from the manipulation LAM because the two video types create different visual learning problems.

Done well, this dataset lets the VLA learn latent concepts for advancing, turning, aligning, and lowering the body before you collect large amounts of real-robot trajectory data. That is the practical value of the recipe: it does not remove teleoperation, but it reduces how much expensive teleoperation you need and gives fine-tuning a better prior.

Main Technical Sources

What This Article Is For

Why Action-Free Ego Video?

Separate the Locomotion LAM and Manipulation LAM

Your dataset should therefore have two buckets from the beginning:

Bucket	Primary data	What it teaches	Avoid mixing in
Locomotion LAM	Egocentric video while the person or robot moves through the scene	Advance, retreat, sidestep, turn, squat, stop near target	Clips that only show tabletop hand motion
Manipulation LAM	Object manipulation video, preferably from robot data or a robot dataset	Grasp, lift, place, push, pull, contact affordance	Long walking clips without clear manipulation
Teleop fine-tuning	Trajectories with robot state and real commands	Ground latents into joint targets and locomotion commands	Action-free video without synchronized actions

WholeBodyVLA learns locomotion and manipulation as a connected stack

Minimum Equipment

Component	Practical minimum	Notes
Camera	Action camera, phone on a head strap, or lightweight RGB camera	Prioritize stable video and an eye-level viewpoint
Mount	Head strap, helmet mount, or head-level adapter	Head-mounted is closer to a humanoid head camera than chest-mounted
Video format	1080p at 30 fps is a reasonable start	60 fps helps with faster motion but increases storage
Audio	Optional	Useful for spoken instructions, but consider privacy
Time sync	System clock or clap marker at the start of clips	Useful later when joining metadata and scene logs
Scene props	Table, shelf, box, basket, cart, handle, objects to pick	Choose objects with clear contact meaning

Motion Primitives to Capture

Primitive	Example clip	What a good clip shows	Common mistake
Advance	Walk from a doorway to a table with a box	Target grows larger, path is clear, final stop is near the workspace	Operator moves too fast or stops too far away
Backward	Step away from a table after placement	Context remains visible, motion is controlled	Long backward walking in an empty scene
Lateral step	Sidestep to align with a basket or carton	Target shifts laterally, then ends near the center of view	Sidestep without a visible target
Turn	Rotate in place to face a cart or shelf	Heading changes clearly, target ends centered	Operator both turns and advances too much
Squat	Lower the body to reach a low container	Camera height lowers, low target becomes easier to reach	Squat is not tied to any object
Approach-to-contact	Move toward a handle, box, shelf edge, or tabletop	Operator stops at arm-reachable distance	Operator simply passes by the target

A minimal two-hour collection plan can look like this:

Scene A: table + box + basket on the floor
  20 clips approaching the table from 3 distances
  20 clips sidestepping left/right to align with the basket
  20 clips squatting in front of the basket
  20 clips: advance -> sidestep -> squat -> stop

Scene B: shelf + cart with handle
  20 clips turning left/right to face the handle
  20 clips approaching the handle and stopping within reach
  20 clips of short simulated pushing motion, no heavy force required
  20 clips: advance -> turn -> approach handle

Scene C: low box or floor-level object
  20 squat clips from multiple approach angles
  20 clips stepping backward after a squat
  20 clips changing heading and then squatting

Goal-Directed Motion Toward Contact Targets

Good targets include:

Contact target	Why it is useful
Table edge or tabletop	The robot must stop at a workable distance before pick/place
Basket, carton, or low box	Combines approach, lateral alignment, and squat
Cart handle	Combines turning, bimanual placement, and forward push
Door or drawer handle	Requires precise orientation before manipulation
Multi-level shelf	Requires height and distance adjustment
Object on the floor	Requires squatting while keeping the target in view

Weak clips look like this:

Walking around the room without stopping near a target.
Rotating the camera by moving the neck or hand instead of rotating the body.
Squatting in empty space.
Running through the scene so the object appears only briefly.
Capturing from a handheld camera much lower than head level.

A Concrete Capture Protocol

The following protocol is detailed enough to hand to a new data collector.

1. Prepare the Scene

2. Mount the Camera and Check the Field of View

Is the target centered when the operator stops?
Are the tabletop, basket, or handle recognizable?
During a squat, does the low target stay in view?
Is the video stable enough to avoid constant motion blur?
Are file timestamps and clip IDs ordered correctly?

3. Attach an Instruction to Every Clip

Each clip should have a short intent. You can speak it at the beginning, show it to the operator on a phone, or store it in a CSV file. For example:

clip_id,instruction,start_pose,target,primitive_sequence
A_001,"go to the table and stop within reach",far_center,table_edge,advance
A_014,"sidestep right to align with the basket",near_left,basket,lateral_right
B_008,"turn left to face the cart handle",near_angle_right,cart_handle,turn_left
C_022,"squat to reach the low box",near_center,low_box,squat

4. Record Short Clips With Clear Starts and Stops

A simple single-primitive format:

0.0-1.0 s: stand still at the start pose
1.0-5.0 s: execute the main primitive
5.0-7.0 s: stop with the target within manipulation range

A short sequence format:

0.0-1.0 s: stand still
1.0-5.0 s: advance
5.0-8.0 s: turn or sidestep
8.0-11.0 s: squat or align
11.0-13.0 s: hold a stable final stance near the target

5. Write Metadata During the Session

Do not rely on memory after the session. Create manifest.csv or manifest.jsonl as you record. A minimal schema:

{
  "clip_id": "B_008",
  "video_path": "raw/session_2026_06_10/B_008.mp4",
  "scene_id": "shelf_cart_lab",
  "operator_id": "op_03",
  "camera_mount": "head",
  "camera_model": "action_cam_1080p30",
  "instruction": "turn left to face the cart handle",
  "target": "cart_handle",
  "primitive_sequence": ["turn_left", "stop"],
  "start_pose_bucket": "near_angle_right",
  "quality": "unchecked"
}

Data Quality Rules

Criterion	Pass	Fail
Temporal continuity	Clip is continuous, with no jump cuts	Heavy frame drops or cuts in the middle
Viewpoint	Head-mounted egocentric view	Handheld shake or camera far below eye level
Target visibility	Target is visible before the final stop	Target is occluded or only appears for an instant
Primitive clarity	Main motion is recognizable	Motion mixes advance, turn, and squat without clear intent
Final stance	1-2 seconds of stable stop	Clip ends while the camera is still moving wildly
Manipulation relevance	Stop is within reach or at a contact point	Walking unrelated to any manipulation target
Privacy and safety	No bystander faces, unsafe areas, or risky collisions	Sensitive environment or unsafe path

Split the Dataset to Avoid Leakage

Better split options:

Split	How to define it	Purpose
Train	Most clips across most scenes and operators	Learn the main primitive vocabulary
In-scene validation	Same scenes, different clips	Debug reconstruction and latent quality
Held-out scene validation	Reserve one room or layout	Test visual generalization
Held-out operator validation	Reserve one operator	Test gait and head-motion bias
Stress set	Different lighting, low/high targets, offset paths	Find failures before scaling

Minimal Preprocessing Before LAM Training

A LAM learns from consecutive frames, so preprocessing must preserve time. Do not shuffle individual frames. Do not crop so aggressively that the target disappears. A light pipeline:

def preprocess_clip(video):
    frames = decode_video(video, fps=10)  # or sample from 30 fps down to 10 fps
    frames = trim_static_intro_outro(frames, keep_final_stop=True)
    frames = resize_short_side(frames, 256)
    frames = center_or_letterbox_crop(frames, size=224)
    pairs = [(frames[i], frames[i + 1]) for i in range(len(frames) - 1)]
    return pairs

Keep metadata next to processed frames:

processed/
  frames/
    B_008/
      000000.jpg
      000001.jpg
  pairs.jsonl
  manifest.jsonl
  qa_report.csv

Do Not Start With MoCap If the Goal Is the LAM

A practical comparison:

Data collection path	Cost	Best stage	Risk
Head-mounted monocular video	Low	Locomotion LAM pretraining	Needs strong QA to avoid noisy clips
Robot manipulation dataset	Medium to high if self-collected, low if open	Manipulation LAM	Domain gap from your target robot
Real-robot teleoperation	High	Decoder grounding and fine-tuning	Consumes robot time and expert operators
Whole-body MoCap	Very high	Retargeting, motion prior, evaluation	Does not automatically provide ego visual latents

Capture-Day Checklist

[ ] Scene has clear manipulation targets: table, basket, box, handle, shelf.
[ ] Camera is head-mounted, and low targets remain visible during squats.
[ ] Primitive list covers advance, backward, lateral, turn, squat, approach-to-contact.
[ ] Every clip has an instruction or intent metadata.
[ ] Every clip starts and ends with 1-2 seconds of stillness.
[ ] Manifest records clip_id, scene_id, operator_id, target, and primitive.
[ ] QA labels are defined: keep/maybe/drop.
[ ] Validation has a held-out scene, layout, or operator.
[ ] No private or unsafe areas are recorded.

What This Article Is For

Why Action-Free Ego Video?

Separate the Locomotion LAM and Manipulation LAM

Minimum Equipment

Motion Primitives to Capture

Goal-Directed Motion Toward Contact Targets

A Concrete Capture Protocol

1. Prepare the Scene

2. Mount the Camera and Check the Field of View

3. Attach an Instruction to Every Clip

4. Record Short Clips With Clear Starts and Stops

5. Write Metadata During the Session

Data Quality Rules

Split the Dataset to Avoid Leakage

Minimal Preprocessing Before LAM Training

Do Not Start With MoCap If the Goal Is the LAM

Capture-Day Checklist

Conclusion

Main Technical Sources

Related Posts

Nguyễn Anh Tuấn

Related Posts

Bản đồ pipeline WholeBodyVLA

Từ PD giữ dáng đến WBID có tiếp xúc

Teleop toàn thân: TWIST và HOMIE

What This Article Is For

Why Action-Free Ego Video?

Separate the Locomotion LAM and Manipulation LAM

Minimum Equipment

Motion Primitives to Capture

Goal-Directed Motion Toward Contact Targets

A Concrete Capture Protocol

1. Prepare the Scene

2. Mount the Camera and Check the Field of View

3. Attach an Instruction to Every Clip

4. Record Short Clips With Clear Starts and Stops

5. Write Metadata During the Session

Data Quality Rules

Split the Dataset to Avoid Leakage

Minimal Preprocessing Before LAM Training

Do Not Start With MoCap If the Goal Is the LAM

Capture-Day Checklist

Conclusion

Main Technical Sources

Related Posts

Nguyễn Anh Tuấn

Related Posts

Bản đồ pipeline WholeBodyVLA

Từ PD giữ dáng đến WBID có tiếp xúc

Teleop toàn thân: TWIST và HOMIE