Large Language Models

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

NNitay CalderonEEyal Ben-DavidZZorik GekhmanEEran OfekGGal Yona
Published
February 15, 2026
Authors
5
Word Count
17,395
Code
Includes code

LLM factual errors reflect lost keys, not empty shelves, requiring different solutions.

Abstract

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Key Takeaways

  • 1

    Factual errors in LLMs stem from either missing knowledge or inaccessible encoded information, not just accuracy failures.

  • 2

    A knowledge profiling framework distinguishes between facts never learned versus facts learned but impossible to recall.

  • 3

    Measuring encoding through proposition completion and contextual questioning enables assessment without access to model weights.

Limitations

  • The framework requires closed-source model compatibility, limiting applicability to all frontier AI systems.

  • The script cuts off mid-definition of recall measurement, leaving the methodology incomplete.

Keywords

factuality evaluationsLLMsfactual knowledgeencoded factsrecall accessibilitylong-tail factsreverse questionsthinkingreasoningencoding saturation

More in Large Language Models

View all
Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality | Paperchime