26 May 2025 2 min read

Smaller Needles Are Harder for LLMs to Find: How Gold Context Size Impacts Long-Context Performance

Large language models (LLMs) are increasingly being used for tasks that require reasoning over vast amounts of information, from synthesizing scientific literature to navigating complex codebases. A critical challenge in these applications is the "needle-in-a-haystack" problem, where relevant information (the "needle") must be extracted from a sea of irrelevant context (the "haystack"). While previous research has focused on positional bias and the quantity of distractors, a new study from researchers at NIH and Johns Hopkins University reveals that the size of the relevant context—what they call the "gold context"—plays a surprisingly significant role in model performance.

The Gold Context Problem

The study, titled Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find, systematically evaluates how variations in gold context length impact LLM performance on long-context question-answering tasks. The researchers constructed three variants of gold contexts for each benchmark:

Small Gold: Minimal span sufficient to answer the question.
Medium Gold: Additional explanatory or supporting content.
Large Gold: Complete reasoning process or extended relevant context.

These were embedded within a fixed pool of distractor documents to simulate real-world scenarios where relevant information is scattered and noisy.

Key Findings

Smaller Gold Contexts Lead to Worse Performance Across all benchmarks and models, performance dropped sharply when the gold context was shorter. For example, on the CARDBiomedBench, Gemini-2.0-Flash’s accuracy fell from 73% with large gold contexts to just 48% with small ones. GPT-4o showed a similar trend, dropping from 98% to 77%.
Smaller Gold Contexts Amplify Positional Sensitivity Models were remarkably more sensitive to the placement of smaller gold spans, with accuracy declining when relevant content appeared later in the context window. For instance, Gemini-2.0-Flash achieved 94% accuracy when small gold contexts were placed at the start of the input but only 33% when placed near the end—a 61-point drop. Larger gold contexts were more resilient to positional variation.
Domain-Specific Tasks Are Harder The effects were more pronounced in specialized domains like biomedical and mathematical reasoning compared to general-knowledge tasks. This suggests that both information type and gold size compound aggregation difficulty.

Implications for AI Systems

These findings have significant implications for designing robust, context-aware LLM-driven systems:

Retrieval-Augmented Generation (RAG): Systems that rely on retrieving small snippets of relevant information may underperform compared to those that retrieve larger, more explanatory passages.
Agentic Systems: Autonomous agents that integrate scattered, fine-grained information must account for the fragility of small gold contexts.
Evaluation Benchmarks: Future benchmarks should include variable gold context sizes to better reflect real-world conditions.

The study evaluated seven state-of-the-art LLMs, including GPT-4o, Gemini-2.0-Flash, and LLaMA-3 variants, and the results were consistent across architectures and scales. Notably, models achieved near-perfect scores in no-distractor settings, confirming that the failures are due to aggregation breakdowns rather than task difficulty.

Practical Recommendations

The researchers offer several guidelines for practitioners:

Expand Critical Evidence: Where possible, structure or expand retrieved evidence to reduce fragility.
Mitigate Positional Bias: Use techniques like attention calibration or positional encoding adjustments to reduce sensitivity to gold context placement.
Benchmark Realistically: Evaluate systems under varying gold context sizes to ensure robustness.

Conclusion

This work highlights an often-overlooked bottleneck in LLM capabilities: the size of relevant evidence matters just as much as its location. As language models become central to applications requiring precise and trustworthy reasoning—from scientific discovery to personalized assistants—addressing context-size variability will be crucial for ensuring reliability and user trust.

For more details, check out the full paper on arXiv or the GitHub repository.