NoLiMA: GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens.

#ai #llm #rag #programming

Recent advancements in large language models (LLMs) have pushed context window limits to 128K–1M tokens, yet benchmarks like NoLiMA: Long-Context Evaluation Beyond Literal Matching reveal critical gaps in associative reasoning over extended sequences.

NoLiMA demonstrates that while models like GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens. The benchmark’s two-hop associative tasks (e.g., linking “Saxony” to “Semper Opera House” to “Yuki”) reveal that models fail to preserve transitive relationships across 16K+ token windows.

The NoLiMA benchmark highlights a fundamental truth: scaling context windows alone cannot overcome attention mechanisms' inability to model latent relationships. Property graphs provide the missing structural layer, offering explicit relationship encoding and metadata-aware retrieval.

For AI architects, integrating graph-native storage with LLMs isn’t optional—it’s imperative for building systems capable of robust, multi-hop reasoning at scale.

Top comments (1)

Dan Shalev • Feb 19

Models like GPT-4o may still have decent base scores, but their effective context length remains limited when dealing with associative reasoning without literal cues.

DEV Community

NoLiMA: GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens.

Top comments (1)

Read next

Becoming a Senior Developer - Tips from Clean Code

Is DeepSeek Safe to Use? Privacy Concerns You Should Know Before Using DeepSeek AI

Marketing for $0: My 30-Day Challenge

Introducing Liberty Framework: A React/FastAPI Low-Code Platform for Enterprise Applications