2 min read

RealWebAssist: The Future of AI Web Assistance Just Got Real

RealWebAssist: The Future of AI Web Assistance Just Got Real
Photo by Andy Kelly / Unsplash

The dream of AI assistants that can seamlessly help us navigate the web just took a big step forward—and hit some major roadblocks. A new benchmark called RealWebAssist, introduced in a recent arXiv paper, is the first to test how well AI can follow real-world users’ instructions across long, complex web sessions. And the results? Let’s just say we’re not quite living in the age of Jarvis yet.

The Problem with Current AI Web Assistants

Most existing benchmarks for web-based AI agents focus on single, clearly defined tasks—think "book a flight to New York" or "find the cheapest laptop on Amazon." But in the real world, our interactions with the web are messy, ambiguous, and full of context-dependent instructions. We change our minds, refer back to previous steps, and expect assistants to just "get" what we mean without spelling it out.

Enter RealWebAssist, a new benchmark developed by researchers from Johns Hopkins University and Amazon. Unlike previous tests, it uses instructions collected from real users performing actual web tasks—like shopping for gifts, planning trips, or booking concert tickets—over sessions lasting up to 40 minutes. The dataset includes 1,885 instructions across 66 different websites, from Amazon to Ticketmaster to Google Maps.

Why RealWebAssist Matters

The key innovation here is realism. Real users don’t say, "Click the ‘Add to Cart’ button for the third product in the search results." They say things like, "Get the cheapest one" or "Go back to that other laptop we saw earlier." These instructions require the AI to:

  1. Understand spatial context: Like knowing "the one on the right" refers to a specific product layout.
  2. Track temporal context: Remembering what "the other laptop" was from three steps ago.
  3. Plan multi-step actions: Translating "compare prices on DoorDash" into a sequence of clicks.
  4. Learn user routines: Picking up on shortcuts, like how a user prefers to book flights after a few repetitions.

The Results: AI Still Struggles

The researchers tested state-of-the-art models, including GPT-4o, Gemini 2.0, and Claude 3.7 Sonnet, paired with GUI grounding tools like UGround-V1. The results were… not great. The best-performing model (Claude 3.7 Sonnet + UGround-V1) managed a 12.1% task success rate. That means it could complete a full sequence of user instructions without errors just over one in ten times. Even the average progress—how far the AI got before its first mistake—was only 25%.

Some key failure modes:

  • Grounding models (which translate instructions into clicks) often misinterpret spatial references, like clicking the wrong "cheapest" item.
  • VLMs/LRMs (which rewrite instructions for clarity) sometimes fail at temporal reasoning, like misunderstanding which "first tab" a user meant.
  • Long-context learning didn’t help much. Adding more past steps as context actually hurt performance, suggesting models aren’t great at distilling useful routines.

The Path Forward

Finetuning on real user data did help, boosting performance by up to 22.8% in average progress. But the bigger takeaway is that today’s AI still lacks the nuanced understanding of human intent needed for true web assistance. As the authors note, "Real-world user instructions are not just about grounding clicks—they’re about reasoning, planning, and adapting."

For businesses, this is a wake-up call. AI-powered web assistants are coming, but they’ll need better multimodal reasoning, memory, and user adaptation to handle real-world complexity. Until then, we’re stuck doing a lot of the clicking ourselves.

Want to dive deeper? Check out the full paper on arXiv or the project page.