Robot Parkour: Nhảy, leo cầu thang bằng RL

Parkour -- Bài test khó nhất cho Robot Locomotion

Trong thế giới robot locomotion, parkour là bài kiểm tra tổng hợp khắc nghiệt nhất. Không giống walking hay running trên mặt phẳng, parkour yêu cầu robot phải nhảy qua gap, leo lên vật cản cao, chui qua chướng ngại thấp, và giữ thăng bằng trên bề mặt hẹp -- tất cả trong thời gian thực, với perception và control hoạt động đồng bộ.

Tại sao parkour lại khó đến vậy? Ba lý do chính:

Diverse skills trong một policy: Robot cần biết khi nào nhảy, khi nào leo, khi nào chui -- và chuyển đổi mượt mà giữa các kỹ năng
Vision-based decision making: Chỉ dựa vào proprioception (cảm biến khớp) là không đủ -- robot phải "nhìn" được địa hình phía trước để lên kế hoạch
Precise timing: Một bước nhảy sai 5cm hoặc chậm 50ms có thể khiến robot ngã

Trong bài viết này, mình sẽ phân tích 3 công trình quan trọng nhất về robot parkour bằng reinforcement learning: Extreme Parkour, Robot Parkour Learning, và SoloParkour -- từ kiến trúc training đến kết quả thực tế.

Extreme Parkour (Cheng et al., ICRA 2024)

Paper Extreme Parkour with Legged Robots của Xuxin Cheng, Kexin Shi, Ananye Agarwal và Deepak Pathak tại Carnegie Mellon University là một breakthrough lớn. Kết quả ấn tượng: train robot parkour trong chưa đầy 20 giờ, deploy zero-shot lên Unitree A1 với single front-facing depth camera.

Teacher-Student Framework

Kiến trúc core của Extreme Parkour dựa trên teacher-student distillation:

Giai đoạn 1 -- Teacher policy (privileged information):

Teacher có quyền truy cập ground truth terrain heightmap xung quanh robot
Train bằng PPO trong Isaac Gym với hàng nghìn parallel environments
Teacher học tất cả parkour skills: nhảy lên box 0.6m, nhảy qua gap, leo cầu thang
Reward function kết hợp: forward velocity + survival bonus + energy penalty

Giai đoạn 2 -- Student policy (vision-only):

Student chỉ nhận depth image từ camera (không có privileged info)
Distill knowledge từ teacher: student cố gắng reproduce hành vi của teacher chỉ từ visual input
Depth encoder (CNN) xử lý depth image thành latent representation
Student policy output joint position targets giống teacher

Terrain Curriculum

Điểm then chốt là automatic terrain curriculum:

Level 1: Flat terrain → basic walking
Level 2: Small steps (10cm) → stepping
Level 3: Medium gaps (30cm) → jumping
Level 4: High boxes (40cm) → climbing
Level 5: Mixed obstacles → full parkour

Terrain difficulty tự động tăng khi robot đạt success rate > 80% ở level hiện tại. Cách tiếp cận này giúp policy học progressive -- từ dễ đến khó -- thay vì bị overwhelm bởi obstacles quá khó ngay từ đầu.

Kết quả trên Unitree A1

Robot Unitree A1 (low-cost, ~15kg) với single Intel RealSense D435 camera:

Leo lên box 0.6m (cao gấp 2.3 lần chiều cao robot)
Nhảy qua gap 0.8m
Leo cầu thang liên tục
Tất cả trong real-time, không cần fine-tuning trên robot thật

Đáng chú ý là robot sử dụng depth camera low-frequency, jittery, và có artifacts -- vậy mà single neural network policy vẫn output highly precise control. Điều này chứng minh rằng large-scale RL trong simulation có thể overcome imprecise sensing.

Robot Parkour Learning (Zhuang et al., CoRL 2023)

Paper Robot Parkour Learning của Ziwen Zhuang, Zipeng Fu và cộng sự (Stanford, CMU) giải quyết một vấn đề khác: học diverse parkour skills trong single end-to-end policy, không cần reference motion data.

Explicit vs Implicit Depth Encoding

Đóng góp chính của paper là so sánh hai cách xử lý depth information:

Explicit depth encoder:

Depth image → CNN → explicit heightmap prediction
Robot biết chính xác geometry của obstacle phía trước
Ưu điểm: interpretable, dễ debug
Nhược điểm: reconstruction error tích lũy

Implicit depth encoder:

Depth image → CNN → latent embedding (không reconstruct heightmap)
Network tự quyết định extract feature gì từ depth
Ưu điểm: có thể capture thông tin mà explicit method bỏ qua
Nhược điểm: black-box, khó debug

Paper kết luận rằng implicit encoding hoạt động tốt hơn cho parkour tasks phức tạp, vì network có thể học features relevant cho từng skill cụ thể thay vì bị giới hạn bởi heightmap reconstruction.

Diverse Skills từ Simple Reward

Thay vì design reward riêng cho từng skill (jump reward, climb reward, crawl reward), paper dùng single reward function đơn giản:

reward = forward_velocity + alive_bonus - energy_penalty - contact_penalty

Các skill đa dạng emerge tự nhiên từ terrain curriculum:

Climbing: 0.40m obstacles (1.53x robot height)
Leaping: 0.60m gaps (1.5x robot length)
Crawling: 0.20m barriers (0.76x robot height)
Squeezing: 0.28m slits (hẹp hơn robot width -- robot phải nghiêng người)

Đây là ví dụ đẹp về emergent behavior -- complex skills xuất hiện từ simple objectives kết hợp diverse environments.

SoloParkour (Chane-Sane et al., CoRL 2024)

SoloParkour từ LAAS-CNRS (Pháp) đưa ra phương pháp mới: constrained reinforcement learning cho visual parkour, demo trên robot Solo-12.

Constrained RL Formulation

Thay vì dùng reward shaping phức tạp, SoloParkour formulate parkour như constrained optimization problem:

Objective: Maximize agile locomotion skills
Constraints: Giữ trong physical limits của robot (torque limits, joint limits, stability)

Cách tiếp cận này có lợi thế lớn: robot được khuyến khích thử aggressive maneuvers (nhảy cao, chạy nhanh) nhưng bị constraint không vượt quá giới hạn vật lý -- giảm risk hỏng hardware khi deploy.

Privileged Experience Warm-Start

Pipeline training gồm 2 phase:

Phase 1 -- Privileged policy (không cần vision):

Train policy với privileged information (terrain heightmap, exact robot state)
Policy này đạt high performance vì có complete state information

Phase 2 -- Visual policy (từ depth images):

Dùng experience từ privileged policy để warm-start off-policy RL algorithm
Thay vì train from scratch (expensive), visual policy bắt đầu từ good initial behaviors
Sample-efficient hơn rất nhiều so với on-policy methods như PPO

Single Policy, Multiple Terrains

SoloParkour train một policy duy nhất trên curriculum gồm nhiều loại terrain:

Crawl parkour: floating objects, robot phải chui bên dưới
Step/hurdle parkour: obstacles để leo và nhảy qua
Leap parkour: gaps để nhảy qua
Difficulty tăng dần trong quá trình training

Kết quả: single policy có thể walk, climb, leap, và crawl -- tất cả từ depth pixels, không cần switch giữa multiple specialized policies.

So sánh 3 phương pháp

Tiêu chí	Extreme Parkour	Robot Parkour Learning	SoloParkour
Robot	Unitree A1	Unitree A1	Solo-12
Vision	Depth camera	Depth camera	Depth camera
Framework	Teacher-student	End-to-end	Constrained RL
Depth encoding	Implicit	Explicit + Implicit	Implicit
Training	PPO + distillation	PPO + curriculum	Constrained RL + warm-start
Skills	Jump, climb, stairs	Climb, leap, crawl, squeeze	Walk, climb, leap, crawl
Reference motion	No	No	No
Conference	ICRA 2024	CoRL 2023	CoRL 2024
Max obstacle height	0.6m	0.4m	0.3m

ANYmal Parkour -- Industrial Scale

Ngoài research trên small robots, ETH Zurich cũng demonstrate parkour capabilities trên ANYmal -- robot quadruped 50kg cấp industrial. Paper Learning Agile Locomotion on Risky Terrains cho thấy ANYmal-D đạt peak velocity 2.5 m/s trên stepping stones và di chuyển trên balance beams hẹp.

Cách tiếp cận của ETH khác biệt ở chỗ formulate parkour như navigation task thay vì velocity tracking: robot tự quyết định tốc độ dựa trên terrain difficulty. Trên terrain dễ thì chạy nhanh, gặp obstacle khó thì chậm lại và cẩn thận hơn.

Training Pipeline tổng quát cho Robot Parkour

Dựa trên 3 papers trên, pipeline chung cho robot parkour gồm các bước:

Bước 1: Terrain Generation

Tạo diverse terrain trong simulation (Isaac Gym hoặc MuJoCo):

Flat ground, slopes, stairs
Random boxes, gaps, barriers
Combination terrains (parkour courses)
Parametric difficulty control

Bước 2: Privileged Teacher

Train teacher policy với complete state information:

Ground truth heightmap hoặc terrain parameters
Perfect robot state (no sensor noise)
PPO với large batch size (4096+ environments)

Bước 3: Visual Student (Distillation)

Transfer knowledge sang vision-based student:

Depth image input → CNN encoder → latent → policy
Behavior cloning hoặc DAgger từ teacher
Hoặc constrained RL warm-start (SoloParkour approach)

Bước 4: Domain Randomization

Randomize để robust cho real world:

Camera noise, latency, field of view
Terrain friction, compliance
Robot mass, COM position
Actuator strength, delay

Bước 5: Real-World Deployment

Deploy và iterate:

Zero-shot transfer (preferred)
Fine-tune nếu cần (minimal)
Monitor failure cases, thêm vào training curriculum

Challenges còn lại

Dù đã impressive, robot parkour vẫn còn nhiều thách thức:

Long-horizon planning: Hiện tại policies react ngắn hạn. Parkour thật cần plan trước nhiều bước (nhìn trước 5-10m)
Recovery from failure: Khi robot ngã giữa chừng parkour course, cần recovery policy để đứng dậy và tiếp tục
Generalization: Train trong sim với specific terrains, nhưng real world có infinite variety
Bipedal parkour: Hầu hết research là quadruped. Bipedal parkour khó hơn nhiều lần (xem Humanoid Parkour Learning)

Kết luận

Robot parkour là lĩnh vực phát triển cực nhanh trong 2023-2024. Từ Extreme Parkour (teacher-student distillation) đến Robot Parkour Learning (emergent skills từ simple rewards) đến SoloParkour (constrained RL), chúng ta thấy rằng RL + vision + terrain curriculum là công thức thành công cho agile locomotion.

Nếu bạn mới bắt đầu, hãy đọc lại các phần trước trong series:

Bài tiếp theo -- Part 6: Bipedal Walking -- sẽ đi sâu vào thách thức điều khiển robot 2 chân, từ Cassie đến humanoid.