Artificial intelligence is advancing at a breakneck pace, yet one of the most unsettling realities of today’s systems is how uneven they are. Demis Hassabis, co-founder of DeepMind, recently called out this problem in an interview with Business Insider. He described it as “inconsistency.” An AI model can ace college-level math or generate working code, but stumble on simple commonsense reasoning or misinterpret a straightforward logical puzzle. To Hassabis, these failures are not minor bugs. They expose a deeper structural challenge in building artificial general intelligence (AGI).
This jagged profile of abilities is sometimes referred to as “jagged intelligence.” It raises uncomfortable questions. Should intelligence be defined not just by peak performance but also by the smoothness and predictability of performance across domains? Can we trust a system that soars in some areas and collapses in others, often in ways that defy intuition? And perhaps most critically, how do we measure and address this inconsistency before scaling AI into domains where brittle failures could have high stakes?
What Jagged Intelligence Looks Like
The term captures the unevenness of current AI capabilities. Consider how large language models can draft elegant essays or parse programming syntax but produce bizarre mistakes in tasks that require step-by-step logic. This inconsistency is not random. It reflects a deeper entanglement between knowledge and reasoning inside today’s neural architectures.
A recent study titled General Reasoning Requires Learning to Reason from the Get-go(Han et al., 2025) highlights this problem directly. The authors found that LLMs often fail to transfer reasoning skills when confronted with new tasks outside their training distribution. In other words, they have learned statistical correlations, not reasoning processes. The result is a model that looks brilliant in some contexts but suddenly crumbles in others, a textbook example of jagged intelligence.
Why Consistency Matters
Some may argue that all human intelligence is jagged. A Nobel-winning physicist might struggle to tie a knot or recall the name of a neighbor. Yet humans display bounded unpredictability. When faced with a novel reasoning task, most people can at least attempt to apply transferable logic. They fail, but they fail predictably.
By contrast, AI failures can be opaque and dangerous. A medical AI might correctly identify complex genetic patterns in one case, then miss an obvious diagnosis in another. A self-driving system could handle urban traffic smoothly but make catastrophic errors in rare weather conditions. These failures are not just inconvenient. They undermine trust and raise safety concerns.
For policymakers and businesses, this means evaluating AI systems requires more than measuring peak accuracy on benchmark tests. It requires understanding the “shape” of performance across tasks. Consistency, or at least transparency about where models are strong and where they are brittle, becomes as important as raw capability.
Measuring and Benchmarking Consistency
This raises another challenge: how do we quantify consistency? Current benchmarks often emphasize headline results on narrow tasks. They reward peak scores, not smoothness of ability. Researchers are now pushing for richer evaluation methods that capture variance across domains.
The Thinking Beyond Tokensframework (2025) proposes benchmarks that stress modular reasoning, memory, and adaptation. Instead of testing static recall, these tasks assess whether a model can generalize logic to new situations or coordinate across modules. Similarly, Han et al. argue that AI needs synthetic curricula of reasoning tasks to build transferable logic from the ground up. These ideas move toward a definition of intelligence that prizes stability and predictability, not just brilliance in cherry-picked cases.
Architectural Paths Forward
If the core of jagged intelligence lies in architecture, what might smooth it out? Several lines of research point to promising directions.
One path is hybrid design. A 2025 paper titled A Hybrid Cognitive Architecture for AGI(Hans et al.) outlines a model that combines neural networks for perception with symbolic engines for reasoning, linked through graph neural networks. The symbolic layer handles tasks like commonsense rules and structured logic, domains where today’s LLMs falter. Neural modules, meanwhile, manage perception and pattern recognition. The goal is to achieve complementary strengths rather than isolated peaks of ability.
Another path is neuro-symbolic integration. Alessandro Oltramari’s work on cognitive neuro-symbolic systems argues that combining explicit rules with neural learning can help disentangle knowledge from reasoning. This approach may allow systems to adapt better to unfamiliar situations while avoiding the brittle gaps of purely statistical learning.
A third line of thought comes from surveys like Architectural Precedents for General Agents(Wray et al., 2025), which catalog design patterns that recur across AI architectures. They identify memory, deliberation, planning, and modularity as critical ingredients for building more consistent agents. These features echo cognitive science insights, suggesting that AGI will require not just scale but also structural alignment with how intelligence is organized.
The Ethical and Safety Dimension
The implications extend beyond technical debates. Inconsistent AGI systems could introduce systemic risks if deployed too early. A finance AI that performs flawlessly in routine cases but collapses under rare scenarios could destabilize markets. A military system that shows jagged intelligence might misinterpret ambiguous inputs with catastrophic results.
This is why safety researchers emphasize not just alignment of goals but robustness of performance. If jagged intelligence is left unaddressed, AGI could behave unpredictably in precisely the domains where trust and control matter most. Policymakers may need to insist on evaluation standards that highlight variance, not just peak accuracy, before green-lighting deployment in sensitive fields.
A Broader Reflection
There is also a philosophical dimension. Defining intelligence only by peak accomplishments risks repeating a mistake we have made with humans: valuing brilliance while ignoring reliability. A student who gets every answer right on one exam but fails completely on the next does not seem “intelligent” in a meaningful sense. Likewise, a machine that oscillates between genius and nonsense is not truly general.
The concept of jagged intelligence forces us to rethink what kind of intelligence we want from machines. Do we prize occasional brilliance, or do we demand steady competence across contexts? The answer will shape how we design systems, how we regulate them, and how we trust them.
Conclusion
Hassabis’ warning about inconsistency is more than a critique of today’s models. It is a reminder that the road to AGI is not just about scale, compute, or datasets. It is about smoothing the jagged edges, so intelligence looks less like a mountain range and more like a steady landscape.
To get there, researchers are exploring hybrid architectures, neuro-symbolic systems, modular reasoning frameworks, and new benchmarks that measure variance as well as peaks. The challenge is daunting, but the reward is immense: a form of intelligence that is not just dazzling in bursts but trustworthy, predictable, and safe to rely on.
Until then, the unevenness of today’s AI should serve as a warning. Raw performance may impress, but true intelligence will be judged by its consistency.
Click here to read this article on Dave’s Demystify Data and AI LinkedIn newsletter.