30 May 2025 3 min read

From Chat Logs to Collective Insights: How AI Can Extract Big-Picture Trends from Millions of Conversations

Large language model (LLM)-powered chatbots are generating an unprecedented volume of conversational data—millions of interactions daily. But what if we could analyze these logs not just as isolated exchanges, but as a collective dataset revealing societal trends, demographic concerns, and emerging topics? A new paper from researchers at the University of Waterloo and Cornell introduces Aggregative Question Answering (AQA), a novel AI task designed to extract high-level insights from massive conversational datasets.

The Problem: Missing the Forest for the Trees

Current AI systems treat chatbot interactions as independent events, missing the bigger picture. As the paper notes, conversations don’t happen in isolation—they’re shaped by time, location, and user demographics. This oversight means we’re losing valuable insights into:

Temporal trends: How do user concerns shift over weeks or months?
Demographic patterns: What topics resonate differently across age groups or regions?
Emerging issues: Can we detect early signals of new societal concerns?

Introducing Aggregative Question Answering

The researchers propose AQA as a solution—a task requiring AI models to reason across thousands of conversations to answer questions like:

"What topics were Californians discussing before the election?"
"How have attitudes toward AI evolved this month?"
"Which programming languages are trending among developers in Europe?"

Unlike traditional summarization (condensing a few documents) or database queries (structured lookups), AQA demands holistic reasoning over vast, unstructured dialogue logs.

The WildChat-AQA Benchmark

To enable research, the team built WildChat-AQA—a benchmark with:

182,330 real-world chatbot conversations (from the WildChat dataset)
6,027 aggregative questions covering 28 topics, 455 subtopics, and 14,482 keyword categories

Each question comes with 10 candidate answers, requiring models to rank them by relevance. The dataset includes metadata like timestamps, user locations, and inferred topics—crucial for demographic and temporal analysis.

Why Existing AI Struggles with AQA

The paper evaluates current methods, revealing significant gaps:

Standard retrieval-augmented generation (RAG) fails to identify broad patterns, focusing too narrowly on individual conversations.
Fine-tuning LLMs on the dataset showed minimal improvement, suggesting current architectures can’t internalize aggregative knowledge effectively.
Computational costs explode when processing millions of tokens for each query.

PROBE: A Smarter Retrieval Approach

The team developed PROBE (Probing Retrieval Of Broad Evidence), a two-step method:

Broad Query Generation: An LLM creates diverse sub-queries to capture different facets of the question (e.g., for "emerging concerns among young adults", it might generate queries about mental health, finances, and climate).
Evidence Aggregation: Retrieved documents are pooled and re-ranked, prioritizing those relevant to multiple sub-queries.

PROBE outperformed standard RAG by 14.8–23.8 NDCG points (a ranking metric), but challenges remain:

Even with perfect retrieval, models struggled to synthesize insights from raw conversations.
Summarized inputs helped (+4.0–6.6 NDCG), suggesting noisy dialogue logs hinder reasoning.

The Computational Bottleneck

AQA demands processing 10¹ to 10⁸ tokens per query—prohibitively expensive with today’s models. The best performer, OpenAI’s o4-mini, achieved 75.7% accuracy (NDCG@1) but required nearly 300M input tokens for some queries.

Ethical Considerations

The paper flags potential misuse—AQA could analyze sensitive topics like elections or public health. To mitigate risks, WildChat-AQA uses public, anonymized data with an ODC-BY license, encouraging transparent research.

What’s Next?

The authors outline key frontiers for AQA:

Efficient long-context reasoning: New architectures to handle massive inputs.
Streaming analytics: Real-time aggregation as new conversations arrive.
Cost reduction: Techniques like hierarchical indexing to curb computational expenses.

Why This Matters for Business

For enterprises, AQA offers a lens into:

Customer sentiment at scale: Spot shifting concerns across regions or demographics.
Market trends: Detect emerging topics before they trend on social media.
Product feedback: Aggregate nuanced pain points from support chats.

As LLM chatbots become ubiquitous, tools like AQA could transform conversational data from a byproduct into a strategic asset—provided we solve the technical and ethical challenges.

The benchmark and code are available at GitHub, with an interactive data viewer at http://65.108.32.135:3000/dataview.