RL cho Manipulation: MDP, Reward Design và Environment Setup

Bạn đã bao giờ tự hỏi — làm thế nào để một cánh tay robot học được cách cầm nắm đồ vật, mà không cần ai lập trình từng chuyển động? Câu trả lời nằm ở Reinforcement Learning (RL) — một paradigm cho phép robot tự khám phá, thử nghiệm, và dần dần "ngộ ra" cách thao tác với thế giới vật lý.

Đây là bài đầu tiên trong series RL for Manipulation, nơi chúng ta sẽ đi từ nền tảng lý thuyết đến huấn luyện robot thực hiện các tác vụ manipulation phức tạp. Trong bài này, chúng ta sẽ tập trung vào ba trụ cột: MDP formulation, reward design, và environment setup trong MuJoCo.

Nếu bạn chưa quen với RL cơ bản, hãy đọc trước bài RL Basics cho Robotics để nắm vững các khái niệm nền tảng.

Manipulation dưới góc nhìn MDP

Markov Decision Process là gì?

Mọi bài toán RL đều được mô hình hóa dưới dạng Markov Decision Process (MDP), định nghĩa bởi tuple $(S, A, T, R, \gamma)$:

Thành phần	Ý nghĩa	Ví dụ trong Manipulation
$S$ (State space)	Tập trạng thái	Vị trí khớp, vận tốc, pose vật thể
$A$ (Action space)	Tập hành động	Torque hoặc vị trí mục tiêu cho khớp
$T$ (Transition)	Xác suất chuyển trạng thái	Vật lý mô phỏng (MuJoCo dynamics)
$R$ (Reward)	Hàm phần thưởng	+1 khi nắm được vật, -0.01 mỗi bước
$\gamma$ (Discount)	Hệ số chiết khấu	Thường 0.99 cho manipulation

Điểm mấu chốt khi áp dụng MDP vào manipulation là thiết kế đúng state space và action space. Nếu bạn cho robot quá ít thông tin (state nghèo nàn), nó không thể học. Nếu bạn cho quá nhiều thông tin không cần thiết, việc học trở nên chậm và không ổn định.

State Space cho Manipulation

State space trong manipulation thường bao gồm ba nhóm thông tin chính:

1. Robot State (Proprioception)

Vị trí các khớp (joint positions): $q \in \mathbb{R}^n$
Vận tốc các khớp (joint velocities): $\dot{q} \in \mathbb{R}^n$
Vị trí và hướng gripper (end-effector pose): $x_{ee} \in SE(3)$
Trạng thái gripper (mở/đóng): $g \in [0, 1]$

2. Object State

Vị trí vật thể (position): $p_{obj} \in \mathbb{R}^3$
Hướng vật thể (orientation): $q_{obj} \in SO(3)$
Vận tốc tuyến tính và góc: $v_{obj}, \omega_{obj}$

3. Task-Specific Information

Vị trí mục tiêu (goal position): $p_{goal} \in \mathbb{R}^3$
Thông tin tiếp xúc (contact forces)
Dữ liệu cảm biến xúc giác (tactile sensing)

import numpy as np

def get_manipulation_state(sim):
    """Trích xuất state vector cho manipulation task."""
    # Robot proprioception
    joint_pos = sim.data.qpos[:7]      # 7-DOF arm
    joint_vel = sim.data.qvel[:7]
    gripper_pos = sim.data.qpos[7:9]   # 2-finger gripper

    # End-effector pose
    ee_pos = sim.data.site_xpos[0]     # Vị trí end-effector
    ee_rot = sim.data.site_xmat[0].reshape(3, 3)  # Ma trận xoay

    # Object state
    obj_pos = sim.data.qpos[9:12]      # Vị trí vật thể
    obj_quat = sim.data.qpos[12:16]    # Quaternion vật thể
    obj_vel = sim.data.qvel[9:15]      # 6-DOF velocity

    # Relative information (rất quan trọng!)
    rel_pos = obj_pos - ee_pos         # Vector từ gripper đến vật

    state = np.concatenate([
        joint_pos, joint_vel,           # 14 dims
        gripper_pos,                    # 2 dims
        ee_pos,                        # 3 dims
        obj_pos, obj_quat, obj_vel,    # 13 dims
        rel_pos,                       # 3 dims
    ])
    return state  # Total: 35 dims

Mẹo quan trọng: Luôn bao gồm relative position (khoảng cách từ gripper đến vật thể) trong state. Thông tin tương đối này giúp policy tổng quát hóa tốt hơn so với chỉ dùng vị trí tuyệt đối.

Action Space: Torque vs Position vs Velocity

Việc chọn action space ảnh hưởng rất lớn đến khả năng học và chất lượng chuyển động:

Action Space	Ưu điểm	Nhược điểm	Khi nào dùng
Joint Torque	Linh hoạt nhất, kiểm soát lực trực tiếp	Khó học, cần nhiều sample, chuyển động giật	Nghiên cứu force control
Joint Position	Dễ học, chuyển động mượt	Không kiểm soát lực trực tiếp	Đa số tác vụ manipulation
Joint Velocity	Cân bằng giữa torque và position	Cần tuning PD gains	Pick-and-place, trajectory following
End-effector Delta	Trực quan, dễ transfer	Cần IK solver, singularity issues	Teleoperation-style tasks

Trong thực tế, Joint Position control (hay Delta Position) được dùng phổ biến nhất vì nó cho phép policy tập trung vào "đi đâu" thay vì "dùng bao nhiêu lực":

# Action space: delta joint position
action_space = spaces.Box(
    low=-0.05,  # Giới hạn thay đổi mỗi bước
    high=0.05,
    shape=(8,),  # 7 joints + 1 gripper
    dtype=np.float32
)

def apply_action(sim, action):
    """Áp dụng delta position action."""
    current_pos = sim.data.qpos[:8].copy()
    target_pos = current_pos + action
    # Clip vào joint limits
    target_pos = np.clip(target_pos, joint_lower, joint_upper)
    sim.data.ctrl[:8] = target_pos

Reward Design: Nghệ thuật và Khoa học

Reward design là phần khó nhất và quan trọng nhất khi làm RL cho manipulation. Một reward function tốt có thể giúp robot học trong vài giờ, trong khi reward function tệ khiến robot không bao giờ học được gì — hoặc tệ hơn, học được những hành vi "hack reward" kỳ quái.

Sparse vs Dense Reward

Sparse Reward — chỉ cho phần thưởng khi hoàn thành task:

def sparse_reward(achieved_goal, desired_goal, threshold=0.05):
    """Reward = 1 nếu đạt goal, 0 nếu không."""
    distance = np.linalg.norm(achieved_goal - desired_goal)
    return 1.0 if distance < threshold else 0.0

Dense Reward — cho phản hồi liên tục dựa trên tiến trình:

def dense_reward(ee_pos, obj_pos, goal_pos, grasped):
    """Reward liên tục dựa trên khoảng cách và trạng thái."""
    # Phase 1: Tiến đến vật thể
    reach_dist = np.linalg.norm(ee_pos - obj_pos)
    reach_reward = 1.0 - np.tanh(5.0 * reach_dist)

    # Phase 2: Nâng vật thể lên goal
    if grasped:
        place_dist = np.linalg.norm(obj_pos - goal_pos)
        place_reward = 1.0 - np.tanh(5.0 * place_dist)
        return reach_reward + 5.0 * place_reward
    
    return reach_reward

Tiêu chí	Sparse Reward	Dense Reward
Dễ thiết kế	Rất dễ	Khó, dễ sai
Tốc độ học	Rất chậm	Nhanh hơn nhiều
Risk of reward hacking	Thấp	Cao nếu thiết kế sai
Cần HER?	Gần như bắt buộc	Không nhất thiết
Scalable	Tốt	Khó scale cho multi-step

Nguyên tắc thiết kế Reward Function

Qua nhiều năm nghiên cứu, cộng đồng RL đã đúc kết một số nguyên tắc vàng:

1. Chia reward thành phases: Mỗi phase tương ứng một giai đoạn của task (tiếp cận → nắm → di chuyển → thả).

2. Dùng distance-based shaping: Hàm tanh hoặc exponential decay giúp gradient ổn định:

# Tốt: gradient ổn định ở mọi khoảng cách
reward = 1.0 - np.tanh(5.0 * distance)

# Kém: gradient quá nhỏ khi xa, quá lớn khi gần
reward = -distance

# Kém: gradient = 0 ở mọi nơi trừ boundary
reward = float(distance < threshold)

3. Penalize undesired behavior: Thêm penalty cho hành động quá lớn, va chạm mạnh, hoặc rơi vật thể:

# Penalty cho action quá lớn (smooth motion)
action_penalty = -0.01 * np.sum(action ** 2)

# Penalty cho rơi vật (nếu đã nắm)
drop_penalty = -10.0 if was_grasped and not is_grasped else 0.0

4. Normalize reward components: Đảm bảo các thành phần reward có scale tương đương để agent không bỏ qua thành phần quan trọng.

Setup Manipulation Environment trong MuJoCo

Tại sao MuJoCo?

MuJoCo (Multi-Joint dynamics with Contact) là simulator hàng đầu cho manipulation RL nhờ:

Contact dynamics chính xác — mô phỏng ma sát, va chạm phức tạp
Tốc độ cao — hàng nghìn bước mỗi giây
Miễn phí (từ 2022, khi DeepMind mua lại)
Tích hợp tốt với Gymnasium, Stable-Baselines3

Tạo Custom Manipulation Environment

Dưới đây là code hoàn chỉnh để tạo một environment manipulation cơ bản — robot arm phải nắm một khối lập phương và đưa đến vị trí mục tiêu:

import gymnasium as gym
from gymnasium import spaces
import mujoco
import mujoco.viewer
import numpy as np


# ---- MuJoCo XML model ----
MANIPULATION_XML = """
<mujoco model="simple_manipulation">
  <option timestep="0.002" gravity="0 0 -9.81"/>
  
  <worldbody>
    <!-- Ground plane -->
    <geom type="plane" size="1 1 0.1" rgba="0.9 0.9 0.9 1"/>
    
    <!-- Table -->
    <body name="table" pos="0.5 0 0.4">
      <geom type="box" size="0.4 0.4 0.02" rgba="0.6 0.3 0.1 1"
            mass="100" contype="1" conaffinity="1"/>
    </body>
    
    <!-- Robot Arm (simplified 4-DOF) -->
    <body name="base" pos="0.0 0 0.42">
      <joint name="j0" type="hinge" axis="0 0 1" range="-3.14 3.14"/>
      <geom type="cylinder" size="0.04 0.05" rgba="0.2 0.2 0.8 1"/>
      
      <body name="link1" pos="0 0 0.1">
        <joint name="j1" type="hinge" axis="0 1 0" range="-1.57 1.57"/>
        <geom type="capsule" fromto="0 0 0 0.3 0 0" size="0.03" rgba="0.3 0.3 0.9 1"/>
        
        <body name="link2" pos="0.3 0 0">
          <joint name="j2" type="hinge" axis="0 1 0" range="-2.0 2.0"/>
          <geom type="capsule" fromto="0 0 0 0.25 0 0" size="0.025" rgba="0.4 0.4 1.0 1"/>
          
          <body name="link3" pos="0.25 0 0">
            <joint name="j3" type="hinge" axis="0 1 0" range="-2.0 2.0"/>
            <geom type="capsule" fromto="0 0 0 0.1 0 0" size="0.02" rgba="0.5 0.5 1.0 1"/>
            
            <!-- Simple Gripper -->
            <body name="gripper" pos="0.1 0 0">
              <site name="grip_site" pos="0 0 0" size="0.01"/>
              <body name="finger_left" pos="0 0.02 0">
                <joint name="finger_l" type="slide" axis="0 1 0" range="0 0.04"/>
                <geom type="box" size="0.01 0.005 0.03" rgba="0.8 0.2 0.2 1"
                      contype="1" conaffinity="1" friction="2 0.05 0.01"/>
              </body>
              <body name="finger_right" pos="0 -0.02 0">
                <joint name="finger_r" type="slide" axis="0 -1 0" range="0 0.04"/>
                <geom type="box" size="0.01 0.005 0.03" rgba="0.8 0.2 0.2 1"
                      contype="1" conaffinity="1" friction="2 0.05 0.01"/>
              </body>
            </body>
          </body>
        </body>
      </body>
    </body>
    
    <!-- Object to manipulate -->
    <body name="object" pos="0.5 0 0.45">
      <freejoint name="obj_joint"/>
      <geom type="box" size="0.025 0.025 0.025" rgba="0.1 0.8 0.1 1"
            mass="0.1" contype="1" conaffinity="1" friction="1 0.05 0.01"/>
    </body>
    
    <!-- Goal marker (visual only) -->
    <body name="goal" pos="0.4 0.2 0.55">
      <geom type="sphere" size="0.02" rgba="1 0 0 0.5" contype="0" conaffinity="0"/>
    </body>
  </worldbody>
  
  <actuator>
    <position name="a_j0" joint="j0" kp="100"/>
    <position name="a_j1" joint="j1" kp="100"/>
    <position name="a_j2" joint="j2" kp="100"/>
    <position name="a_j3" joint="j3" kp="100"/>
    <position name="a_fl" joint="finger_l" kp="50"/>
    <position name="a_fr" joint="finger_r" kp="50"/>
  </actuator>
</mujoco>
"""


class SimpleManipulationEnv(gym.Env):
    """Custom MuJoCo manipulation environment.
    
    Task: Robot arm phải nắm khối lập phương xanh
    và đưa đến vị trí goal (hình cầu đỏ).
    """
    metadata = {"render_modes": ["human", "rgb_array"]}
    
    def __init__(self, render_mode=None, reward_type="dense"):
        super().__init__()
        
        self.render_mode = render_mode
        self.reward_type = reward_type
        
        # Load MuJoCo model
        self.model = mujoco.MjModel.from_xml_string(MANIPULATION_XML)
        self.data = mujoco.MjData(self.model)
        
        # Action: 4 joint deltas + 1 gripper command
        self.action_space = spaces.Box(
            low=-1.0, high=1.0, shape=(5,), dtype=np.float32
        )
        
        # Observation: joint pos/vel + ee pos + obj pos/vel + goal + relative
        obs_dim = 4 + 4 + 3 + 3 + 3 + 3 + 3  # = 23
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(obs_dim,), dtype=np.float64
        )
        
        self.max_steps = 200
        self.current_step = 0
        self._goal_pos = np.array([0.4, 0.2, 0.55])
        
    def _get_obs(self):
        """Xây dựng observation vector."""
        joint_pos = self.data.qpos[:4]
        joint_vel = self.data.qvel[:4]
        ee_pos = self.data.site_xpos[0]  # grip_site position
        obj_pos = self.data.qpos[6:9]     # Object position (freejoint)
        obj_vel = self.data.qvel[6:9]
        rel_obj = obj_pos - ee_pos
        rel_goal = self._goal_pos - obj_pos
        
        return np.concatenate([
            joint_pos, joint_vel, ee_pos,
            obj_pos, obj_vel, rel_obj, rel_goal
        ])
    
    def _compute_reward(self, obs):
        """Tính reward dựa trên trạng thái hiện tại."""
        ee_pos = obs[8:11]
        obj_pos = obs[11:14]
        
        reach_dist = np.linalg.norm(ee_pos - obj_pos)
        goal_dist = np.linalg.norm(obj_pos - self._goal_pos)
        
        if self.reward_type == "sparse":
            return 1.0 if goal_dist < 0.05 else 0.0
        
        # Dense reward: phần thưởng liên tục
        reach_reward = 1.0 - np.tanh(5.0 * reach_dist)
        goal_reward = 1.0 - np.tanh(5.0 * goal_dist)
        
        # Bonus khi object ở gần goal
        success_bonus = 10.0 if goal_dist < 0.05 else 0.0
        
        return reach_reward + 2.0 * goal_reward + success_bonus
    
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        mujoco.mj_resetData(self.model, self.data)
        
        # Randomize object position trên bàn
        self.data.qpos[6] = 0.5 + self.np_random.uniform(-0.1, 0.1)
        self.data.qpos[7] = self.np_random.uniform(-0.1, 0.1)
        self.data.qpos[8] = 0.45
        
        # Randomize goal
        self._goal_pos = np.array([
            0.4 + self.np_random.uniform(-0.1, 0.1),
            0.2 + self.np_random.uniform(-0.1, 0.1),
            0.55 + self.np_random.uniform(-0.05, 0.05),
        ])
        
        mujoco.mj_forward(self.model, self.data)
        self.current_step = 0
        
        return self._get_obs(), {}
    
    def step(self, action):
        # Scale actions
        joint_delta = action[:4] * 0.05  # Max 0.05 rad per step
        gripper_cmd = (action[4] + 1) / 2 * 0.04  # [0, 0.04]
        
        # Apply actions
        self.data.ctrl[:4] = self.data.qpos[:4] + joint_delta
        self.data.ctrl[4] = gripper_cmd   # Left finger
        self.data.ctrl[5] = gripper_cmd   # Right finger
        
        # Simulate
        for _ in range(10):  # 10 substeps
            mujoco.mj_step(self.model, self.data)
        
        obs = self._get_obs()
        reward = self._compute_reward(obs)
        
        self.current_step += 1
        terminated = self.current_step >= self.max_steps
        truncated = False
        
        return obs, reward, terminated, truncated, {}

Huấn luyện với Stable-Baselines3

Với environment đã tạo, chúng ta có thể huấn luyện agent bằng PPO hoặc SAC:

from stable_baselines3 import PPO, SAC
from stable_baselines3.common.vec_env import SubprocVecEnv
from stable_baselines3.common.callbacks import EvalCallback

def make_env(rank, seed=0):
    def _init():
        env = SimpleManipulationEnv(reward_type="dense")
        env.reset(seed=seed + rank)
        return env
    return _init

# Tạo parallel environments
num_envs = 8
env = SubprocVecEnv([make_env(i) for i in range(num_envs)])

# Train PPO
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=256,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,
    verbose=1,
    tensorboard_log="./logs/"
)

# Callback để đánh giá
eval_env = SimpleManipulationEnv(reward_type="dense")
eval_callback = EvalCallback(
    eval_env, 
    best_model_save_path="./best_model/",
    eval_freq=10000,
    n_eval_episodes=20,
    deterministic=True
)

# Bắt đầu huấn luyện
model.learn(total_timesteps=1_000_000, callback=eval_callback)
model.save("manipulation_ppo")

So sánh Sparse vs Dense Reward trong thực nghiệm

Sau khi huấn luyện 1 triệu timesteps, đây là kết quả điển hình:

Metric	Sparse Reward (PPO)	Dense Reward (PPO)	Dense Reward (SAC)
Success Rate	~5%	~65%	~78%
Avg Return	0.12	145.3	187.6
Convergence	Không converge	~400K steps	~300K steps
Behavior quality	Random	Hợp lý	Mượt mà

Sparse reward gần như không thể học được với PPO thuần túy — bạn sẽ cần Hindsight Experience Replay (HER) để giải quyết, điều chúng ta sẽ thảo luận chi tiết trong bài 4 về Pick-and-Place.

Gymnasium Robotics: Environments có sẵn

Thay vì tự tạo environment từ đầu, bạn có thể dùng Gymnasium Robotics — bộ sưu tập environments manipulation chuẩn:

import gymnasium as gym
import gymnasium_robotics

# Fetch environments (7-DOF arm + gripper)
env = gym.make("FetchReach-v3")        # Đưa gripper đến mục tiêu
env = gym.make("FetchPush-v3")         # Đẩy vật đến vị trí
env = gym.make("FetchSlide-v3")        # Trượt vật trên mặt bàn
env = gym.make("FetchPickAndPlace-v3") # Nhặt và đặt

# Shadow Hand environments (24-DOF)
env = gym.make("HandReach-v2")         # Đưa ngón tay đến vị trí
env = gym.make("HandManipulateBlock-v2")  # Xoay khối

Mỗi environment cung cấp cả observation (state đầy đủ) và desired_goal / achieved_goal — cấu trúc lý tưởng cho HER.

Tài liệu tham khảo

Gymnasium Robotics: A Standard Interface for Reinforcement Learning in Robot Manipulation — Towers et al., 2024
MuJoCo: A physics engine for model-based control — Todorov et al., 2012
A Survey on Robot Manipulation — Cui et al., 2023

Tiếp theo trong Series

Bài tiếp theo — Grasping với RL: Stable Grasp và Object Variety — chúng ta sẽ áp dụng nền tảng đã học để huấn luyện robot nắm đồ vật với nhiều hình dạng và kích thước khác nhau. Chúng ta sẽ so sánh PPO vs SAC và thiết kế curriculum learning cho grasping.