← Back to Blog
manipulationrlforce-controlmanipulationbalance

RL Force Control: Balancing a Cup of Water

Train a robot to carry a cup of water without spilling using RL — reward design for force control, impedance baseline, and SAC training.

Nguyễn Anh Tuấn18 tháng 3, 202610 min read
RL Force Control: Balancing a Cup of Water

Imagine this: you pick up a full cup of coffee and walk from the kitchen to your desk. Your brain continuously adjusts grip force, wrist tilt angle, and walking speed — all subconsciously. Now, teach a robot to do the same. This is not just a grasping problem, but a force control problem — delicate force regulation so the object does not tilt, shake, or spill.

In the previous post — Grasping with RL — we learned how to grasp objects. Now, we level up: the robot must grasp and keep balanced a cup of water throughout its motion.

The Cup-of-Water Problem

Why is Force Control Hard?

Force control is fundamentally different from position control:

Criterion Position Control Force Control
Objective Move end-effector to position X Maintain force/torque at value Y
Feedback Encoder (joint position) Force/Torque sensor
Sensitivity Tolerant to a few mm Sensitive to 0.1N
Contact dynamics Less important Critically important
Stability Easy to stabilize Prone to oscillation

When carrying a cup of water, the robot must simultaneously:

  1. Grip with just the right force — too strong crushes the cup, too weak drops it
  2. Keep the cup upright — tilt > 15 degrees = water spills
  3. Move smoothly — sudden acceleration = water sloshes out
  4. Minimize vibration — high jerk = water surface oscillates

Balancing a cup with precise control

Impedance Control Baseline

Before using RL, let us understand impedance control — the traditional method for force control. Impedance control models the robot as a spring-damper system:

$$F = K(x_{desired} - x) + D(\dot{x}_{desired} - \dot{x})$$

where $K$ is stiffness and $D$ is damping.

import numpy as np

class ImpedanceController:
    """Variable Impedance Controller for cup balancing."""
    
    def __init__(self, kp_pos=100.0, kd_pos=20.0, 
                 kp_rot=50.0, kd_rot=10.0):
        self.kp_pos = kp_pos  # Position stiffness
        self.kd_pos = kd_pos  # Position damping
        self.kp_rot = kp_rot  # Rotation stiffness
        self.kd_rot = kd_rot  # Rotation damping
    
    def compute_action(self, current_pos, desired_pos,
                       current_vel, desired_vel,
                       current_rot, desired_rot,
                       current_angvel):
        """Compute required force/torque."""
        # Position error
        pos_error = desired_pos - current_pos
        vel_error = desired_vel - current_vel
        
        # Force command
        force = self.kp_pos * pos_error + self.kd_pos * vel_error
        
        # Rotation error (simplified - keep upright)
        rot_error = desired_rot - current_rot
        torque = self.kp_rot * rot_error - self.kd_rot * current_angvel
        
        return np.concatenate([force, torque])
    
    def compute_grasp_force(self, object_mass, tilt_angle):
        """Compute grasp force based on mass and tilt angle."""
        # Minimum grasp force to prevent slipping
        min_force = object_mass * 9.81 / (2 * 0.5)  # mu = 0.5
        
        # Increase force when tilted (compensate gravity component)
        safety_factor = 1.5 + 0.5 * abs(tilt_angle) / np.pi
        
        return min_force * safety_factor

Impedance control works well for simple trajectories but cannot adapt to:

This is why we need RL.

Reward Design for Cup Balancing

Multi-Objective Reward

The reward function for cup balancing must balance multiple competing objectives:

class CupBalanceReward:
    """Reward function for cup-of-water balancing task."""
    
    def __init__(self):
        self.tilt_threshold = np.radians(15)   # Max 15 degrees
        self.spill_threshold = np.radians(30)  # Spilling point
        self.jerk_weight = 0.1
        self.prev_vel = None
    
    def compute(self, cup_tilt, cup_angular_vel, ee_vel, 
                ee_accel, goal_dist, action, grasping):
        """
        Args:
            cup_tilt: Cup tilt angle from vertical (rad)
            cup_angular_vel: Cup angular velocity [3]
            ee_vel: End-effector velocity [3]
            ee_accel: End-effector acceleration [3]
            goal_dist: Distance to goal
            action: Action vector
            grasping: Bool, currently grasping
        """
        if not grasping:
            return -50.0, {'spill': True}  # Large penalty if dropped
        
        rewards = {}
        
        # 1. TILT PENALTY — Keep cup upright
        tilt_magnitude = abs(cup_tilt)
        if tilt_magnitude > self.spill_threshold:
            rewards['tilt'] = -20.0  # Water spilled!
            rewards['spill'] = True
        else:
            rewards['tilt'] = -5.0 * (tilt_magnitude / self.tilt_threshold) ** 2
            rewards['spill'] = False
        
        # 2. ANGULAR VELOCITY PENALTY — Reduce oscillation
        ang_vel_mag = np.linalg.norm(cup_angular_vel)
        rewards['angular_vel'] = -2.0 * np.tanh(3.0 * ang_vel_mag)
        
        # 3. JERK PENALTY — Smooth motion
        jerk = np.linalg.norm(ee_accel)
        rewards['jerk'] = -self.jerk_weight * np.tanh(jerk)
        
        # 4. PROGRESS REWARD — Move toward goal
        rewards['progress'] = 2.0 * (1.0 - np.tanh(3.0 * goal_dist))
        
        # 5. SPEED REWARD — Fast but not too fast
        speed = np.linalg.norm(ee_vel)
        if goal_dist > 0.1:
            rewards['speed'] = 0.5 * min(speed, 0.3) / 0.3
        else:
            rewards['speed'] = -1.0 * speed
        
        # 6. SUCCESS BONUS
        if goal_dist < 0.05 and tilt_magnitude < self.tilt_threshold:
            rewards['success'] = 20.0
        else:
            rewards['success'] = 0.0
        
        # 7. ACTION SMOOTHNESS
        rewards['action_smooth'] = -0.01 * np.sum(action ** 2)
        
        total = sum(rewards.values()) - rewards.get('spill', 0)
        return total, rewards

Analyzing the Trade-offs

This reward clearly demonstrates multi-objective trade-offs:

MuJoCo Environment: Cup with Liquid Approximation

MuJoCo does not support fluid simulation directly, but we can approximate it using rigid body dynamics:

import mujoco
import numpy as np

CUP_BALANCE_XML = """
<mujoco model="cup_balance">
  <option timestep="0.002" gravity="0 0 -9.81"/>
  
  <worldbody>
    <light pos="0 0 3" dir="0 0 -1"/>
    <geom type="plane" size="2 2 0.1" rgba="0.9 0.9 0.9 1"/>
    
    <!-- Table -->
    <body name="table" pos="0.5 0 0.4">
      <geom type="box" size="0.6 0.6 0.02" rgba="0.6 0.4 0.2 1" mass="100"/>
    </body>
    
    <!-- Robot arm (simplified 5-DOF) -->
    <body name="base" pos="0 0 0.42">
      <joint name="j0" type="hinge" axis="0 0 1" range="-3.14 3.14" damping="2"/>
      <geom type="cylinder" size="0.05 0.04" rgba="0.3 0.3 0.3 1"/>
      
      <body name="l1" pos="0 0 0.08">
        <joint name="j1" type="hinge" axis="0 1 0" range="-1.57 1.57" damping="2"/>
        <geom type="capsule" fromto="0 0 0 0.3 0 0" size="0.035" rgba="0.7 0.7 0.7 1"/>
        
        <body name="l2" pos="0.3 0 0">
          <joint name="j2" type="hinge" axis="0 1 0" range="-2.5 2.5" damping="1.5"/>
          <geom type="capsule" fromto="0 0 0 0.25 0 0" size="0.03" rgba="0.7 0.7 0.7 1"/>
          
          <body name="l3" pos="0.25 0 0">
            <joint name="j3" type="hinge" axis="0 0 1" range="-3.14 3.14" damping="1"/>
            <geom type="capsule" fromto="0 0 0 0.1 0 0" size="0.025" rgba="0.5 0.5 0.5 1"/>
            
            <body name="wrist" pos="0.1 0 0">
              <joint name="j4" type="hinge" axis="1 0 0" range="-1.57 1.57" damping="1"/>
              <site name="ee" pos="0 0 0" size="0.01"/>
              
              <!-- Gripper fingers -->
              <body name="fl" pos="0 0.025 0">
                <joint name="jfl" type="slide" axis="0 1 0" range="0 0.035" damping="5"/>
                <geom type="box" size="0.008 0.004 0.04" rgba="0.8 0.2 0.2 1"
                      contype="1" conaffinity="1" friction="2 0.5 0.01"/>
              </body>
              <body name="fr" pos="0 -0.025 0">
                <joint name="jfr" type="slide" axis="0 -1 0" range="0 0.035" damping="5"/>
                <geom type="box" size="0.008 0.004 0.04" rgba="0.8 0.2 0.2 1"
                      contype="1" conaffinity="1" friction="2 0.5 0.01"/>
              </body>
            </body>
          </body>
        </body>
      </body>
    </body>
    
    <!-- Cup -->
    <body name="cup" pos="0.45 0 0.44">
      <freejoint name="cup_free"/>
      <site name="cup_top" pos="0 0 0.06" size="0.005"/>
      
      <!-- Cup walls (hollow cylinder approximation) -->
      <geom name="cup_bottom" type="cylinder" size="0.03 0.003" pos="0 0 0" 
            rgba="0.9 0.9 1 0.8" mass="0.05" contype="1" conaffinity="1"/>
      <geom name="cup_wall1" type="box" size="0.003 0.03 0.03" pos="0.03 0 0.03"
            rgba="0.9 0.9 1 0.8" mass="0.01"/>
      <geom name="cup_wall2" type="box" size="0.003 0.03 0.03" pos="-0.03 0 0.03"
            rgba="0.9 0.9 1 0.8" mass="0.01"/>
      <geom name="cup_wall3" type="box" size="0.03 0.003 0.03" pos="0 0.03 0.03"
            rgba="0.9 0.9 1 0.8" mass="0.01"/>
      <geom name="cup_wall4" type="box" size="0.03 0.003 0.03" pos="0 -0.03 0.03"
            rgba="0.9 0.9 1 0.8" mass="0.01"/>
      
      <!-- Liquid approximation: ball inside cup -->
      <body name="liquid" pos="0 0 0.02">
        <joint name="liquid_x" type="slide" axis="1 0 0" range="-0.02 0.02" damping="5"/>
        <joint name="liquid_y" type="slide" axis="0 1 0" range="-0.02 0.02" damping="5"/>
        <geom name="liquid_ball" type="sphere" size="0.02" rgba="0.2 0.5 1 0.6" 
              mass="0.2" contype="0" conaffinity="0"/>
      </body>
    </body>
    
    <!-- Goal position -->
    <body name="goal" pos="0.5 0.3 0.55">
      <geom type="sphere" size="0.03" rgba="0 1 0 0.3" contype="0" conaffinity="0"/>
      <site name="goal_site" pos="0 0 0" size="0.01"/>
    </body>
  </worldbody>
  
  <actuator>
    <position name="a0" joint="j0" kp="200"/>
    <position name="a1" joint="j1" kp="200"/>
    <position name="a2" joint="j2" kp="200"/>
    <position name="a3" joint="j3" kp="100"/>
    <position name="a4" joint="j4" kp="100"/>
    <position name="afl" joint="jfl" kp="80"/>
    <position name="afr" joint="jfr" kp="80"/>
  </actuator>
</mujoco>
"""


class CupBalanceEnv:
    """Environment for cup balancing task."""
    
    def __init__(self):
        self.model = mujoco.MjModel.from_xml_string(CUP_BALANCE_XML)
        self.data = mujoco.MjData(self.model)
        self.reward_fn = CupBalanceReward()
        self.max_steps = 300
        self.goal_pos = np.array([0.5, 0.3, 0.55])
        self.prev_ee_vel = np.zeros(3)
        
    def get_cup_tilt(self):
        """Compute cup tilt angle from vertical."""
        cup_quat = self.data.qpos[7:11]
        rot = np.zeros(9)
        mujoco.mju_quat2Mat(rot, cup_quat)
        rot = rot.reshape(3, 3)
        cup_up = rot[:, 2]
        cos_angle = cup_up[2]
        tilt = np.arccos(np.clip(cos_angle, -1, 1))
        return tilt
    
    def get_liquid_offset(self):
        """Get relative position of liquid ball."""
        liq_x = self.data.qpos[11]
        liq_y = self.data.qpos[12]
        return np.array([liq_x, liq_y])
    
    def step(self, action):
        joint_delta = action[:5] * 0.03  # Smaller for smoothness
        gripper = (action[5] + 1) / 2 * 0.035
        
        self.data.ctrl[:5] = self.data.qpos[:5] + joint_delta
        self.data.ctrl[5] = gripper
        self.data.ctrl[6] = gripper
        
        for _ in range(10):
            mujoco.mj_step(self.model, self.data)
        
        ee_pos = self.data.site_xpos[0]
        ee_vel = (ee_pos - self.prev_ee_pos) / (0.002 * 10)
        ee_accel = (ee_vel - self.prev_ee_vel) / (0.002 * 10)
        
        cup_tilt = self.get_cup_tilt()
        cup_angular_vel = self.data.qvel[10:13]
        goal_dist = np.linalg.norm(ee_pos - self.goal_pos)
        liquid_offset = self.get_liquid_offset()
        
        grasping = self._check_grasp()
        
        reward, info = self.reward_fn.compute(
            cup_tilt, cup_angular_vel, ee_vel,
            ee_accel, goal_dist, action, grasping
        )
        
        self.prev_ee_vel = ee_vel.copy()
        self.prev_ee_pos = ee_pos.copy()
        
        return self._get_obs(), reward, False, info

Training with SAC

from stable_baselines3 import SAC

model = SAC(
    "MlpPolicy",
    cup_env,
    learning_rate=1e-4,       # Lower LR for stability
    buffer_size=500_000,
    batch_size=512,
    tau=0.001,                # Slow target update
    gamma=0.995,              # Higher than usual
    train_freq=2,
    gradient_steps=2,
    ent_coef="auto",
    target_entropy="auto",
    verbose=1,
    policy_kwargs=dict(
        net_arch=[256, 256, 128],  # Larger network
    )
)

model.learn(total_timesteps=3_000_000)

RL vs Impedance Control Comparison

Metric Impedance Control SAC (RL)
Max tilt (avg) 12.3 deg 6.8 deg
Spill rate 18% 4%
Avg travel time 8.2s 5.1s
Jerk (smoothness) 15.6 8.3
Adapts to new cups Requires retuning Self-adapts
Adapts to perturbation Poor Good

RL clearly outperforms — especially in adaptability. The learned policy knows to slow down before turning, tilt the cup slightly to compensate centrifugal force, and react quickly to perturbations.

Smooth trajectory comparison

Advanced Technique: Variable Impedance RL

A powerful approach is to combine impedance control with RL — the RL policy does not directly control joint commands, but instead controls impedance parameters ($K$, $D$):

class VariableImpedancePolicy:
    """RL policy outputs impedance parameters."""
    
    def __init__(self, base_controller):
        self.controller = base_controller
    
    def act(self, obs, rl_output):
        """
        rl_output: [kp_x, kp_y, kp_z, kd_x, kd_y, kd_z, 
                     desired_x, desired_y, desired_z]
        """
        # RL selects stiffness and damping
        kp = np.exp(rl_output[:3]) * 50   # [5, 500] range
        kd = np.exp(rl_output[3:6]) * 5   # [0.5, 50] range
        
        # RL selects desired position offset
        desired_offset = rl_output[6:9] * 0.02  # Max 2cm
        
        self.controller.kp_pos = np.diag(kp)
        self.controller.kd_pos = np.diag(kd)
        
        current_desired = self.get_trajectory_point() + desired_offset
        
        return self.controller.compute_action(
            current_pos, current_desired,
            current_vel, np.zeros(3),
            current_rot, np.array([0, 0, 1]),
            current_angvel
        )

This approach has a major advantage for sim-to-real transfer — the impedance controller provides safety bounds while RL provides adaptability. For details on sim-to-real for force control, see Domain Randomization.

References

  1. Learning Variable Impedance Control for Contact-Rich Manipulation — Martin-Martin et al., 2019
  2. Variable Impedance Control in End-Effector Space — Buchli et al., 2011
  3. Reinforcement Learning for Contact-Rich Manipulation — Survey, 2023

Next in the Series

Next up — Precision Pick-and-Place: Position & Orientation Control — we tackle placing objects with sub-cm accuracy, including orientation alignment. Hindsight Experience Replay (HER) will be the star.

Related Posts

Related Posts

TutorialSim-to-Real cho Humanoid: Deployment Best Practices
sim2realhumanoiddeploymentrlunitreePart 10

Sim-to-Real cho Humanoid: Deployment Best Practices

Pipeline hoàn chỉnh deploy RL locomotion policy lên robot humanoid thật — domain randomization, system identification, safety, và Unitree SDK.

9/4/202611 min read
ResearchUnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1
vlaunitreeg1manipulationhumanoid

UnifoLM-VLA-0: Mô hình VLA cho Manipulation trên Unitree G1

Phân tích và hướng dẫn triển khai UnifoLM-VLA-0 — mô hình VLA open-source đầu tiên chạy trực tiếp trên G1 humanoid

8/4/202623 min read
ResearchWholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation
vlahumanoidloco-manipulationiclrrl

WholeBodyVLA: VLA Toàn Thân cho Humanoid Loco-Manipulation

ICLR 2026 — học manipulation từ egocentric video, kết hợp VLA + RL cho locomotion-aware control

7/4/202613 min read