Yann LeCun's Case Against LLMs: Why It Still Matters

Every major AI lab is racing to build systems that can reason, plan, and act autonomously. OpenAI, Google DeepMind, Anthropic — they’re all betting heavily that scaling large language models is the path to artificial general intelligence. And then there’s Yann LeCun, Chief AI Scientist at Meta, Turing Award winner, and arguably the most credentialed voice in the room willing to say, loudly and repeatedly: you’re all going in the wrong direction. That tension — between the scaling optimists and LeCun’s persistent skepticism — is one of the most useful intellectual debates happening in AI right now. Not because LeCun is definitely right. But because engaging seriously with his arguments forces you to think harder about what intelligence actually is, and what we’re actually building.

Who Is Yann LeCun, and Why Should You Care What He Thinks?

LeCun isn’t a contrarian for sport. He’s one of the three people (alongside Geoffrey Hinton and Yoshua Bengio) who won the 2018 Turing Award for foundational work on deep learning — the same family of techniques that powers everything from image recognition to ChatGPT. His work on convolutional neural networks in the 1980s and 90s is the reason your phone can unlock with your face. He’s not a skeptic of AI. He’s a skeptic of a specific approach to AI, which is a very different thing.

Since joining Meta (then Facebook) as VP and Chief AI Scientist in 2013, LeCun has had a platform that most academics can only dream of. He uses it constantly — on X (formerly Twitter), in academic papers, in conference keynotes, and in a growing number of podcast appearances. His positions are specific, technical, and often deliberately provocative. He thinks Geoffrey Hinton and others who warn about near-term existential AI risk are wrong. He thinks GPT-4, Claude, and Gemini are impressive but fundamentally limited. And he has a detailed alternative vision for what real machine intelligence would look like. Whether you agree with him or not, his framework is worth understanding. If you want a broader map of the people shaping these debates, the 25 AI Thinkers and Creators Worth Following in 2026 is a useful companion.

The Core Argument: LLMs Can’t Think, They Predict

LeCun’s central critique of large language models is that they are, at their core, next-token predictors. They learn statistical patterns from text. They don’t build a model of the world. They don’t understand causality. They can’t plan multi-step actions in the physical world. And they hallucinate — not occasionally, but structurally — because generating plausible-sounding text and generating true text are not the same objective.

He’s made this point in various forms across multiple venues. In a widely shared 2022 paper titled “A Path Towards Autonomous Machine Intelligence,” LeCun laid out what he thinks is actually required for human-level AI: a system that can build persistent world models, reason about the future, plan hierarchically, and learn from much less data than current LLMs require. He argues that humans learn to understand the physical world primarily through sensorimotor experience — watching, touching, moving through space — not through reading text. A child who has never read a single word still develops a rich model of gravity, object permanence, and social dynamics. LLMs skip all of that and try to reconstruct world understanding from text alone. LeCun thinks this is a fundamental architectural mistake.

His proposed alternative is something he calls the Joint Embedding Predictive Architecture (JEPA). Rather than predicting exact pixel values or token sequences, JEPA learns to predict abstract representations of future states — building a kind of compressed world model. Meta’s AI research team (FAIR) has been actively working on this, with Image-JEPA and V-JEPA released as research models. These are genuinely interesting research directions, though they remain far from deployed, general-purpose systems. LeCun is honest about this — he tends to describe the path to his vision as a decade or more of hard research, not something that will emerge from the next training run.

The Debates That Made Him Famous (and Controversial)

LeCun doesn’t just publish papers. He argues, publicly and often. A few specific exchanges are worth knowing because they illuminate where the real fault lines are.

LeCun vs. Hinton on AI risk: When Geoffrey Hinton left Google in 2023 and began speaking publicly about existential risks from AI, LeCun pushed back hard. His position, stated across multiple interviews and posts, is roughly: current AI systems are not remotely close to human-level intelligence, and treating them as if they are leads to misallocated concern. He thinks the AI doom narrative is not just premature but actively harmful because it distracts from real, present-day harms — bias, misuse, economic disruption — and gives AI systems credit for capabilities they don’t actually have. Hinton disagrees. Both are serious scientists. The disagreement is real and unresolved.

LeCun vs. the scaling hypothesis: The dominant assumption in most frontier AI labs is that intelligence scales with data, compute, and model size. LeCun’s challenge is that even a perfectly scaled LLM is still doing next-token prediction over text, and text is a lossy, impoverished representation of the world. He’s compared this to trying to understand physics by reading physics textbooks without ever running an experiment. Sam Altman and Demis Hassabis have both, in various ways, expressed more optimism that emergent capabilities will continue to surprise us. LeCun’s counterpoint: emergence is interesting but not sufficient — architecture matters.

Podcast appearances worth finding: LeCun has appeared on Lex Fridman’s podcast multiple times — Episodes 212 and 306 are particularly substantive, covering both his technical arguments and his broader views on AGI timelines and AI safety. He’s also appeared on the Dwarkesh Patel podcast in 2023, where he goes deep on the JEPA architecture and why he thinks the AI safety discourse is misdirected. These are long-form conversations where his arguments are laid out in detail rather than compressed into tweets, and they’re worth the time if you want to engage with his actual position rather than a summary of it.

Where LeCun Is Probably Right (and Where He Might Be Wrong)

Engaging seriously with LeCun means being honest about both sides of his ledger.

Where he’s likely right: LLMs do hallucinate structurally, and this is a real limitation for high-stakes applications. The best models — GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro — still confabulate facts, fail basic physical reasoning tasks, and struggle with multi-step planning in novel environments. Autonomous agents built on LLMs (like early versions of AutoGPT or Devin) have shown real limitations in reliability over long task horizons. LeCun’s intuition that something important is missing feels grounded in observable behavior, not just theoretical concern.

Where he might be underestimating LLMs: The rate of improvement in LLMs has consistently surprised experts, including skeptics. Reasoning models like OpenAI’s o3 and o1, Google’s Gemini 2.0 Flash Thinking, and Anthropic’s extended thinking in Claude 3.7 Sonnet show that explicit chain-of-thought reasoning at inference time meaningfully improves performance on tasks LeCun would have predicted LLMs couldn’t handle — complex math, multi-step logic, even some physical reasoning problems. Whether this constit

LeCun’s Specific Technical Arguments Against LLMs (And What the Evidence Actually Shows)

LeCun’s critique isn’t vibes-based. He has three concrete technical claims, each pointing to a different structural limitation. Here they are, stated precisely, with the evidence he cites.

Claim 1: LLMs Have No World Model

A world model, in LeCun’s framing, is an internal representation that lets a system predict the consequences of actions before taking them. You don’t have to touch a hot stove to know it will burn you. You simulate the outcome first. LLMs have no such thing. They have token distributions — statistical summaries of what text looks like, not causal maps of how the world works.

His 2022 paper “A Path Towards Autonomous Machine Intelligence” (available on OpenReview) lays this out explicitly. The paper proposes a modular architecture he calls JEPA — Joint Embedding Predictive Architecture — which learns representations in an abstract embedding space rather than predicting raw pixel or token outputs. The key idea: a system that predicts in abstract space can build compressed, causal representations of the world. A system that predicts the next token is optimizing for something else entirely.

The practical consequence LeCun draws: LLMs will always struggle with physical intuition, spatial reasoning, and any task where the right answer depends on simulating a process rather than pattern-matching to prior text. This is why GPT-4 can write a recipe for a soufflé but will confidently describe physically impossible outcomes when asked about novel mechanical systems it has no training examples for.

Claim 2: LLMs Cannot Plan

Planning, in a technical sense, means searching over a space of possible action sequences to find one that achieves a goal. It requires being able to evaluate intermediate states — to say “if I do X, then Y becomes possible, and that leads toward Z.” LeCun argues LLMs cannot do this because they have no persistent state between tokens, no ability to simulate forward, and no mechanism to evaluate whether a partial plan is on track.

He demonstrated this point publicly in a 2023 post on X that got significant attention: he argued that a 3-year-old child can stack four blocks reliably, something no LLM-driven robot can do from scratch without extensive additional engineering. The gap isn’t data. It’s that the child has a model of gravity, balance, and object permanence built from physical interaction. The LLM has descriptions of block-stacking.

The evidence in the literature supports a version of this. The 2022 paper “Large Language Models Still Can’t Plan” by Kambhampati et al. at Arizona State tested GPT-3 and GPT-4 on standard planning benchmarks from the automated planning community (Blocksworld, Logistics, etc.). Performance was poor and degraded as plan length increased — exactly what you’d expect from a system doing pattern matching rather than search. Kambhampati, who runs the Yochan lab and has engaged directly with LeCun’s arguments, concluded that LLMs need to be paired with external planners to be reliable on multi-step tasks.

Claim 3: LLM Reasoning Is Autocomplete, Not Inference

LeCun’s third claim is that what looks like reasoning in LLMs is mostly sophisticated pattern completion. When GPT-4 solves a math problem, it’s not running an algorithm — it’s generating tokens that look like the solution to math problems it has seen. This works often enough to be impressive. It breaks down in ways that actual reasoning wouldn’t.

The clearest public evidence for this: the “reversal curse” paper published by Berglund et al. in 2023 showed that GPT-4 trained on “A is B” does not reliably learn “B is A.” A system doing logical inference would handle both directions identically. A system doing token prediction gets tripped up by the direction it saw in training. That’s not a bug in the model — it’s structural evidence of what the model is actually doing.

LeCun’s position: this isn’t fixable by scaling. You can’t reach genuine reasoning by adding more parameters to a next-token predictor. You need a different objective function.

A Practical Stress-Test: Finding LeCun’s Predicted Failure Modes in Your Own LLM Product

If you’re building something with GPT-4o, Claude 3.5, Gemini 1.5, or any other LLM, LeCun’s framework gives you a concrete checklist of where your system is most likely to fail in production. Here’s how to actually run that test.

Test 1: Multi-Step Planning Under Novel Constraints

Give your LLM a task that requires a plan of 6 or more steps where at least one constraint isn’t present in common training examples. Don’t use “plan a trip to Paris” — that’s heavily represented in training data. Use something like: “A warehouse has three robots, two charging stations, and five delivery zones. Robot A is currently charging, Robot B is in Zone 3 with a low battery, and Robot C is idle in Zone 1. A package needs to move from Zone 5 to Zone 2. Generate a step-by-step coordination plan that avoids any robot running out of battery.”

Watch for: confident plans that violate stated constraints, plans that ignore the battery state of Robot B, and plans that change the problem setup mid-response. Per LeCun’s prediction, the model will generate plausible-sounding sequences but won’t reliably track state across steps. Test this with at least 10 variations. In most LLM products without external state tracking, failure rates on constraint-heavy planning tasks above 5 steps are high enough to matter in production.

Test 2: Physical Intuition Outside Training Distribution

Describe a physically novel scenario — one involving uncommon materials or geometries — and ask for a prediction. Example prompt: “A hollow aluminum sphere 30cm in diameter with a wall thickness of 2mm is placed on top of a flat rubber surface. A 5kg steel cylinder is placed on top of the sphere. Describe exactly what happens over the next 5 seconds, including any deformation or movement.” There is no single correct answer to memorize here. A system with a world model would reason from material properties. A system doing token prediction will generate something that sounds plausible but may be physically inconsistent.

Look for: internal contradictions within the same response, confident assertions about outcomes that contradict each other if you ask a follow-up, and refusal to express appropriate uncertainty. This is LeCun’s world model gap showing up directly.

Test 3: Causal Direction Reversal

This is the reversal curse test applied to your domain. If your product uses an LLM to answer questions about your own documentation, run this: take 20 factual pairs from your docs (e.g., “Feature X requires Setting Y to be enabled”), feed the model questions in both directions (“Does Feature X require Setting Y?” and “What features require Setting Y to be enabled?”), and compare accuracy. Per Berglund et al., you will likely see asymmetric performance. The direction the fact appears in training data will have better recall. This matters if your product makes claims in any domain where both directions of a factual relationship are operationally important — compliance, medical, legal, technical support.

Decision Framework: What to Do With What You Find

LeCun’s Specific Technical Claims Against LLMs (And What the Evidence Actually Shows)

LeCun’s critique isn’t “LLMs are bad.” It’s more precise than that, and the precision is what makes it useful. He has three core technical arguments, each one pointing at a different structural limitation.

Argument 1: LLMs Have No World Model

In his 2022 paper “A Path Towards Autonomous Machine Intelligence” (available on OpenReview), LeCun argues that intelligence requires a persistent, updatable internal model of how the world works — one that predicts the consequences of actions before taking them. LLMs don’t have this. They have a compressed statistical representation of text about the world, which is not the same thing.

The practical difference: a system with a world model can mentally simulate “if I push this glass off the table, it falls.” An LLM learns that glasses falling is associated with breaking because that’s what the training text says. It passes the verbal test. It fails the underlying reasoning test. LeCun demonstrated this distinction publicly at a 2023 NYU lecture, using the example of a robot trying to learn to walk — the token prediction objective gives you no gradient signal about physical cause and effect.

The evidence backing this up isn’t just theoretical. A 2023 paper from Apple’s research team, “Large Language Models as Tool Makers,” and a separate evaluation called “Sparks of AGI” from Microsoft both found that LLMs fail systematically on novel physical reasoning tasks that require mental simulation rather than pattern recall. The failures aren’t random — they cluster exactly where LeCun’s theory predicts they would.

Argument 2: LLMs Can’t Plan — They Can Only Complete

LeCun draws a hard line between sequence completion and planning. Planning requires holding a goal, generating candidate action sequences, evaluating them against a world model, and selecting the best one. That’s the architecture of how humans decide what to do next. LLMs do something categorically different: they generate the next plausible token given everything that came before.

This is why chain-of-thought prompting helps but has a ceiling. You can prompt an LLM to “think step by step” and it will produce text that looks like planning. But it’s generating plausible planning-shaped text, not actually searching a space of possible futures. LeCun pointed this out specifically in a January 2024 post on X, noting that systems like AutoGPT and early agent frameworks fail on tasks requiring more than a few interdependent decisions precisely because they have no mechanism to backtrack and revise a plan mid-execution.

The evidence: the ARC-AGI benchmark, designed by François Chollet to test genuine reasoning rather than memorization, still breaks most frontier LLMs. OpenAI’s o3 made progress on it, but through massive compute scaling on a test-taking strategy — not through the kind of architecture LeCun argues is necessary.

Argument 3: The Data Efficiency Problem Reveals What’s Missing

LeCun makes a pointed empirical observation: a four-year-old child learns the concept of object permanence from a few hundred hours of sensory experience. GPT-4 was trained on roughly 13 trillion tokens of text. If the approach were fundamentally correct, the sample efficiency numbers shouldn’t be this far apart.

His proposed alternative, which he calls Joint Embedding Predictive Architecture (JEPA), is designed to learn from raw sensory data by predicting abstract representations of future states rather than raw pixel values or tokens. Meta’s V-JEPA, released in 2024, is the working implementation — it learns physical intuitions from video without labeled data and without token prediction. It’s not AGI, but it’s the proof-of-concept for his alternative path.

A Practical Stress-Test: Find LeCun’s Failure Modes in Your Own LLM Product

If you’re building something on top of GPT-4, Claude, or Gemini, LeCun’s arguments translate directly into testable predictions about where your product will break. Here’s a structured way to run that test yourself, using prompts you can run right now.

Test 1: Novel Physical or Causal Reasoning

Give the model a scenario it almost certainly hasn’t seen verbatim. Something like: “A glass of water is sitting on a table. A cat bumps the table from underneath at a 30-degree angle with enough force to move the table 4 inches in 0.2 seconds. Does the water spill? Walk through the physics.” Don’t ask for a formula — ask for a judgment. Strong models will produce confident-sounding answers. Check whether the reasoning is physically coherent or just plausible-sounding. LeCun’s prediction: the model will confabulate a physically consistent-sounding chain of logic that doesn’t actually simulate the scenario.

Test 2: Multi-Step Planning with Mid-Course Revision

Give your product (or raw API) a task like: “Plan a 5-step process to accomplish X. After step 3, I’m going to tell you something that invalidates step 2. Revise only what needs to change and explain what downstream effects the revision has.” Watch whether the model genuinely backtracks and propagates the change, or just appends a correction while leaving the original broken structure in place. LeCun’s prediction: models will patch rather than replan, because they have no internal plan representation to actually revise.

Test 3: Identify the Memorization Boundary

Take a task your product handles well and introduce a small structural twist that changes the correct answer but keeps the surface form similar. If your product does contract analysis, swap a standard clause for one with an inverted condition. If it does code review, introduce a bug that looks syntactically identical to a common pattern but has opposite semantics. Measure whether accuracy drops sharply. LeCun’s prediction: performance will collapse at exactly the point where pattern matching stops working and genuine reasoning would be required.

Decision Framework: What to Use Instead

Task Type	LeCun’s Predicted LLM Weakness	Better Architecture or Mitigation
Multi-step planning with hard constraints	State drift, constraint violation, no forward simulation	Pair LLM with symbolic planner (e.g., PDDL solver) or use explicit state tracking in code
Physical world predictions	No world model, outputs pattern-matched plausibility	Physics simulation engine; LLM handles only language interface
Novel mathematical or logical reasoning	Autocomplete mimics reasoning; fails outside training distribution

Problem Type	LLM Reliability (per LeCun’s evidence)	Better Architecture / Approach
Text summarization, drafting, reformatting	High — this is pattern completion, what LLMs are built for	LLMs are fine here
Factual recall from training data	Medium — degrades with recency, specificity, and niche domains	LLM plus retrieval-augmented generation (RAG)
Multi-step reasoning with dependencies	Low — errors compound, no backtracking mechanism	Neuro-symbolic systems, explicit planners like PDDL solvers, or LLM plus verifier
Physical world interaction and robotics	Very low — no world model, no sensorimotor grounding	JEPA-style architectures, model-based reinforcement learning
Long-horizon autonomous agents	Very low — planning without world model fails past a few decision points	Hybrid systems: LLM for language interface, separate planning and memory modules
Learning from small amounts of novel data	Very low — sample inefficiency is structural, not a tuning problem	Few-shot learning architectures, JEPA, structured world models

The point isn’t to abandon LLMs. It’s to stop using them where LeCun’s analysis says they’ll structurally fail, and start building the hybrid architectures that compensate for those specific gaps.

Yann LeCun vs Everyone: Why the Top LLM Skeptic Matters