The Robot Fleet Problem
Scaling from 5 to 50 to 500 robots makes manual SSH updates impossible. You need automated deployment, rollback, monitoring, and scaling — exactly what Kubernetes does.
But standard Kubernetes is heavy for edge devices. K3s is a lightweight variant using only 512MB RAM, running on Raspberry Pi and NVIDIA Jetson.
K3s Architecture for Robot Fleet
┌──────────────────────────────────────┐
│ Cloud Control Plane (K3s server) │
│ - Deployment manifests │
│ - ConfigMaps (robot config) │
│ - Secrets (API keys) │
└──────────────────────────────────────┘
|
┌────┴────────┬─────────────┐
v v v
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Robot 1 │ │ Robot 2 │ │ Robot N │
│(K3s agent) │(K3s agent) │(K3s agent)
└─────────┘ └─────────┘ └─────────┘
Setup K3s on Robot
# Install K3s agent on robot
curl -sfL https://get.k3s.io | K3S_URL=https://control-plane:6443 \
K3S_TOKEN=mytoken sh -
# Verify installation
kubectl get nodes
Deploy Robot Software via GitOps
Create Helm chart for robot services:
# robot-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: robot-nav
namespace: fleet
spec:
replicas: 1
selector:
matchLabels:
app: robot-navigation
template:
metadata:
labels:
app: robot-navigation
spec:
nodeSelector:
robot-id: robot-001
containers:
- name: nav
image: ghcr.io/myorg/robot-nav:v1.2.3
resources:
limits:
memory: "256Mi"
cpu: "500m"
volumeMounts:
- name: device-lidar
mountPath: /dev/ttyUSB0
- name: monitoring
image: ghcr.io/myorg/robot-monitor:v1.0
resources:
limits:
memory: "128Mi"
cpu: "200m"
volumes:
- name: device-lidar
hostPath:
path: /dev/ttyUSB0
Deploy to all robots:
# All robots automatically pull latest config
kubectl apply -f robot-deployment.yaml --all-namespaces
Rolling Update Without Downtime
# Update image
kubectl set image deployment/robot-nav \
nav=ghcr.io/myorg/robot-nav:v1.2.4 \
-n fleet
# Monitor rollout
kubectl rollout status deployment/robot-nav -n fleet
# Rollback if needed
kubectl rollout undo deployment/robot-nav -n fleet
Monitoring Fleet Health
# Monitor all robots
import subprocess
import json
result = subprocess.run(
['kubectl', 'get', 'pods', '-A', '-o', 'json'],
capture_output=True
)
pods = json.loads(result.stdout)
for pod in pods['items']:
robot_id = pod['metadata']['namespace']
status = pod['status']['phase']
print(f"Robot {robot_id}: {status}")
Prometheus + Grafana Monitoring
# Install monitoring stack
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.retention=7d
Custom Robot Metrics
from prometheus_client import Gauge, start_http_server
battery_level = Gauge('robot_battery_percent', 'Battery level', ['robot_id'])
mission_count = Gauge('robot_missions_completed', 'Missions done', ['robot_id'])
start_http_server(9090)
# In main loop
battery_level.labels(robot_id="amr-001").set(85.5)
mission_count.labels(robot_id="amr-001").inc()
GitOps with FluxCD
GitOps makes Git the single source of truth:
# Install FluxCD
flux install
# Connect to Git repo
flux create source git robot-fleet \
--url=ssh://[email protected]/vnrobo/fleet-config \
--branch=main
# Auto-deploy when repo changes
flux create kustomization robot-apps \
--source=robot-fleet \
--path="./apps" \
--prune=true \
--interval=5m
Workflow: Commit → GitHub → FluxCD → K3s applies → Robots updated
Handling Unstable Networks
Robots connect via WiFi — networks can drop anytime:
- K3s agent auto-reconnect: reconnects when network recovers
- Tolerations: allow pods to run on temporarily unavailable nodes
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # Wait 5 minutes before rescheduling
Comparison: Fleet Management Solutions
| Criterion | K3s + GitOps | Ansible | Balena |
|---|---|---|---|
| Auto-healing | Yes | No | Yes |
| Rolling update | Yes | Manual | Yes |
| Offline support | Good | No | Good |
| Learning curve | High | Medium | Low |
| Flexibility | Very high | High | Medium |
| Cost | Free | Free | Paid |
Best Practices
- Start small: 3-5 robots first, scale after comfortable
- WireGuard VPN: Between server and robots for security
- Private registry: Own container registry, avoid Docker Hub dependency
- Test rollbacks: Every deploy, test rollback procedure
- Resource limits: Set CPU/memory limits to prevent robot lockup
K3s + GitOps is the production way to manage robot fleets at scale. Combined with MQTT for telemetry, you have complete fleet infrastructure.