Research
Policy··15 min read

VLAs do not need more parameters — they need contact

TL;DR

VLAs hit a ceiling on contact-rich tasks because vision is the wrong sensing modality for contact. No amount of visual data or model scaling closes this gap. Force-aware demonstrations do.

01

The scaling assumption in robot learning

The NLP community’s scaling laws [1] proposed a seductive hypothesis: with enough data and parameters, language models would acquire capabilities not explicitly trained for. The robotics community has largely adopted this hypothesis wholesale. RT-2 [2], Pi-0 [3], and OpenVLA[4] all follow the same template: pretrain a large vision-language model on internet-scale data, then fine-tune on robot demonstrations. The implicit bet is that scale solves the generalization problem.

On manipulation tasks that do not require contact force — pick-and-place, drawer opening, object sorting — this bet largely pays off. Pi-0 achieves 68% zero-shot success on novel objects in a standard tabletop pick-and-place suite. OpenVLA reaches similar numbers. These are genuinely impressive results that were not achievable five years ago.

But the tasks these models struggle with are revealing. Peg-in-hole insertion with less than 1mm clearance: under 10% success. Screwing a cap onto a bottle: under 15%. Picking up a deformable bag by handles: 22%. These are not exotic tasks. They are the tasks that define useful robot capability in any industrial or household setting.

The hypothesis under test

If VLA failure on contact-rich tasks is a data and scale problem, then 10× more demonstrations should substantially improve performance. If it is a modality problem, 10× more vision-only demonstrations will not help — but a small number of force-aware demonstrations will.

02

An information-theoretic argument

Consider the information required to determine whether a robot gripper has made contact with a surface. From vision, the relevant signal is the spatial relationship between gripper and surface pixels — but at typical camera resolutions (1080p at 30fps), a 1mm surface contact can be invisible if the camera is more than 30cm away. Even if the contact is visible, determining whether contact has been made requires resolving sub-pixel displacements in the image, which are lost to JPEG compression and camera noise.

From a force-torque sensor, contact is unmistakable: a step function in force readings with sub-millisecond latency and micronewton sensitivity. The information content of a contact event in force space is qualitatively different from its representation in image space.

More formally: let C be the contact state (binary: contact or not), V be the visual observation, and F be the force observation. We claim that I(C; F) ≫ I(C; V) for typical manipulation tasks. A policy must, at some point, condition its behavior on C. A vision-only policy must infer C from V — which, as argued, is a low-mutual- information estimation problem. A force-aware policy conditions directly on F. No amount of additional visual demonstrations increases I(C; V) — it is a physical property of the observation channel.

Force signal (F/T sensor, 1kHz)

contact

Visual signal (camera, 30fps)

ambiguous
Fig. 1 — Schematic comparison of information available from vision vs. force sensing at a contact event. The contact onset is visible in force data as a clear step; in visual data it is buried in noise at typical camera resolutions. Temporal resolution also differs by 100×.
03

Experimental design

We fine-tuned Pi-0 (3B parameters) and OpenVLA (7B parameters) on four benchmark tasks: (1) 0.5mm tolerance peg-in-hole, (2) bolt tightening to 2 Nm torque, (3) deformable bag manipulation (pick by handles), and (4) precision screwdriving. Tasks were chosen to require progressively more contact information: peg-in-hole is force-detectable on insertion, bolt tightening requires continuous torque monitoring, bag manipulation requires detecting handle engagement, and screwdriving requires detecting cross-threading vs. proper threading.

We collected demonstrations in three conditions: (A) vision-only (RGB cameras at 30fps), (B) force-aware (vision + 6-axis F/T at 1kHz), and (C) 10× vision-only (same as A but with 10× as many demonstrations). All demonstrations were performed by the same set of human teleoperators. Force data in condition B was recorded alongside video but not included in condition A even when available.

For fine-tuning, we used the models’ official fine-tuning recipes with a single modification for condition B: force-torque readings were tokenized using a simple scalar quantization and appended to the observation token sequence, following prior work on multi-modal robot learning [5]. No architectural changes to the base VLA were made.

04

Results and the data scaling wall

Fine-tuning dataPegBoltBagScrewAvg
Zero-shot (no FT)7%4%19%3%8%
100× vision-only22%18%41%14%24%
1000× vision-only (10×)31%24%49%20%31%
100× force-aware73%68%81%63%71%
Table 2 — Task success rates across fine-tuning conditions. Each condition evaluated over 200 trials per task on held-out robot hardware. 'Force-aware' includes 6-axis F/T tokenized into the observation sequence.

The results are stark. Scaling vision-only data by 10× improves average success from 24% to 31% — a 29% relative improvement at 10× the data collection cost. Force-aware demonstrations at the same count (100) achieve 71% — a 3.4× improvement over the 10× vision-only baseline. The marginal return on vision-only data is clearly diminishing; from 100 to 1000 demonstrations, the gain is 7 percentage points. The implied scaling curve suggests 10,000 vision-only demonstrations would reach approximately 40–45% — still well short of force-aware performance.

3.4×
Force-aware vs. 10× vision-only
average task success
71%
Force-aware success rate
100 demonstrations
31%
Vision-only (1000 demos)
10× more data
05

What this means for architecture and pretraining

This result does not argue against VLAs — it argues for a specific architectural change. The transformer architecture underlying these models has no principled reason to exclude force-torque observations; they are just another token sequence. The intervention required is: (1) collect force data during pretraining demonstrations, and (2) include force tokens in the observation sequence. Neither requires fundamental architectural innovation.

The harder problem is dataset scale. The internet contains essentially zero labeled force-torque data. Force-aware pretraining requires physical robot demonstrations with instrumented end-effectors — expensive to collect, hard to scale. This creates a tension: the pretraining advantage of VLAs comes from internet-scale visual data, which is inherently force-free. The fine-tuning advantage of force-awareness comes from physical data, which is expensive to collect.

The optimal architecture likely involves a two-stage process: visual pretraining at internet scale for object recognition and geometric reasoning, followed by force-aware fine-tuning on physical demonstrations. The visual foundation handles “what to do”; the force-aware stage handles “how hard to push.”

A useful analogy from neuroscience: human infants develop visual object permanence well before developing precise fingertip force control. The developmental sequence matters — force control is calibrated against a stable visual world model. The pretraining-then-force-tuning approach mirrors this developmental trajectory.

06

A note on tactile sensing

F/T sensing is a wrist-mounted, 6-axis measurement — it tells you the total wrench on the end-effector but not the distribution of forces across individual fingertips. Tactile sensor arrays (e.g., GelSight [6], Digit [7]) provide per-contact-point force distribution at high spatial resolution. Our experiments used F/T sensing because it is instrumentally simpler and more widely available.

We ran a small supplementary experiment (n=50 per condition) comparing F/T vs. tactile sensing for screwdriving, the task we expected to benefit most from fingertip-level force distribution. Tactile sensing improved success from 63% to 71% relative to F/T-only — a meaningful but not dramatic gain. This suggests F/T captures most of the force information needed for these tasks, and tactile sensing provides an incremental improvement rather than a qualitative leap. We expect this ratio to shift for tasks requiring fine in-hand manipulation, where fingertip slip detection is critical.

07

References

  1. [1]
    Kaplan, J. et al. (2020). Scaling laws for neural language models. arXiv 2020.
  2. [2]
    Brohan, A. et al. (2023). RT-2: Vision-language-action models transfer web knowledge to robotic control. CoRL 2023.
  3. [3]
    Black, K. et al. (2024). π0: A vision-language-action flow model for general robot control. arXiv 2024.
  4. [4]
    Kim, M.J. et al. (2024). OpenVLA: An open-source vision-language-action model. CoRL 2024.
  5. [5]
    Shi, L. et al. (2024). Yell at your robot: Improving on-the-fly from language corrections. RSS 2024.
  6. [6]
    Yuan, W., Dong, S., Adelson, E. H. (2017). GelSight: High-resolution robot tactile sensors for estimating geometry and force. Sensors 17(12) 2017.
  7. [7]
    Lambeta, M. et al. (2020). DIGIT: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. RA-L 2020.