2 min read

Robotic Visual Instruction: A New Way to Command Robots with Hand-Drawn Sketches

Robotic Visual Instruction: A New Way to Command Robots with Hand-Drawn Sketches

The Problem with Natural Language in Robotics

Natural language has long been the go-to medium for human-robot interaction, thanks to advances in large language models (LLMs). But it’s not perfect. Describing precise spatial details—like exact positions, directions, or distances—is often clunky and ambiguous. Plus, in quiet environments like libraries or hospitals, talking to a robot just isn’t practical.

Enter Robotic Visual Instruction (RoVI), a novel paradigm that replaces verbose language with simple, hand-drawn sketches. Think of it as giving a robot a visual to-do list: circles to highlight objects, arrows to show movement, and colors to sequence actions. It’s intuitive, precise, and—most importantly—silent.

How RoVI Works

RoVI breaks down tasks into basic visual primitives:

  • Arrows: Represent movement trajectories, with tails (starting points), shafts (waypoints), and heads (endpoints).
  • Circles: Mark interaction points, like where to grasp an object or press a button.
  • Colors: Define the order of operations (e.g., green for step one, blue for step two).

For example, to instruct a robot to "pick up the cup and move it left," you’d draw a circle around the cup (grasp point) and an arrow pointing left (movement path). No words needed.

The VIEW Pipeline: Turning Sketches into Actions

To make RoVI actionable, the researchers developed Visual Instruction Embodied Workflow (VIEW), a pipeline that translates 2D sketches into 3D robotic actions. Here’s how it works:

  1. VLM Interpretation: A vision-language model (VLM) analyzes the sketch and scene, generating a step-by-step plan in natural language and executable code (e.g., move() or grasp()).
  2. Keypoint Extraction: A YOLOv8-based module extracts spatial constraints (start/end points, waypoints) from the sketch.
  3. Action Execution: The robot follows the code, using the keypoints as guides for precise movement.

The system was trained on RoVI Book, a dataset of 15K annotated sketches paired with task descriptions and code snippets. Fine-tuning smaller VLMs (like LLaVA-7B) enabled edge deployment with minimal computational overhead.

Real-World Performance

In tests across 11 novel tasks—including cluttered environments and multi-step operations—VIEW achieved an 87.5% success rate, outperforming language-based methods like VoxPoser and CoPa. Key advantages:

  • Spatial Precision: Unlike language, sketches provide pixel-level accuracy for trajectories and object interactions.
  • Generalization: Works in unseen environments with new objects.
  • Silent Operation: Ideal for noise-sensitive settings.

Why This Matters

RoVI isn’t just a quirky alternative to language—it solves real problems. By decoupling intent (the sketch) from execution (the robot’s actions), it reduces ambiguity and verbosity. It’s also more user-friendly than alternatives like goal images or full trajectories, which require users to imagine end states or entire motion paths.

The Future of Visual Robot Commands

The team plans to expand RoVI Book with more free-form sketches and optimize smaller models for edge devices. Imagine a factory worker quickly sketching instructions for a robot, or a nurse silently directing a hospital bot. The potential is huge.

For now, RoVI and VIEW offer a glimpse into a future where robots don’t just understand our words—they understand our drawings, too.