Why Nvidia Controls the AI Race — And What Could End That


logo

In January 2025, Jensen Huang walked onto a Las Vegas stage and announced that Nvidia had shipped over a trillion dollars worth of data center infrastructure in a single year. Not revenue — infrastructure value installed in the world. The crowd was full of CEOs who were quietly panicking about whether they’d ordered enough GPUs. That moment captures everything about where AI actually lives right now: not in the models, not in the apps, but in the chips. Whoever controls the compute controls the race. And right now, Nvidia controls the compute.

How Nvidia Ended Up Owning the AI Stack

This didn’t happen by accident. Nvidia spent roughly fifteen years building something that almost nobody cared about until suddenly everyone did. CUDA — Nvidia’s parallel computing platform — launched in 2007. For years it was a niche tool for researchers running physics simulations. Then deep learning happened, and it turned out that training neural networks was exactly the kind of massively parallel workload that GPUs were built for.

By the time OpenAI trained GPT-3 in 2020, the playbook was locked in: you needed Nvidia A100s, you needed CUDA, and everything else in the ML ecosystem — PyTorch, TensorFlow, Hugging Face, you name it — had been built on top of that stack. Switching costs became enormous. Not impossible, but enormous.

The H100, released in 2022, was the chip that trained GPT-4, Claude 2, Gemini 1.0, and essentially every major frontier model of that generation. At peak scarcity in 2023, H100s were selling on secondary markets for $40,000 per unit. Cloud providers had waitlists months long. Startups were making hiring decisions based on GPU allocation. Elon Musk bought 10,000 H100s for xAI and made sure everyone knew about it. The chip became a proxy for AI ambition.

Then came the Blackwell architecture — the B100, B200, and the GB200 NVL72 rack systems — which Nvidia shipped into production throughout 2024 and into 2025. The performance jumps are real: Blackwell delivers roughly 2.5x the training throughput of Hopper for dense transformer workloads, and the NVL72 rack system essentially turns 72 GPUs into one giant interconnected compute unit. Microsoft, Google, Amazon, and Meta have all placed massive Blackwell orders. The scarcity problem hasn’t gone away — it’s just moved up the stack to a new chip generation.

The Numbers That Explain Everything

To understand why Nvidia’s position is so durable, you need to look at a few specific numbers:

  • ~80-85% market share in data center GPUs used for AI training as of early 2025. AMD is the closest competitor. The gap is wide.
  • $39 billion in data center revenue in Q3 FY2025 alone. For context, AMD’s entire annual revenue is around $25 billion.
  • CUDA ecosystem lock-in: Estimates suggest over 4 million developers actively use CUDA. The entire ML toolchain assumes it.
  • NVLink bandwidth: The GB200 NVL72 system offers 1.8 terabytes per second of bidirectional bandwidth between GPUs. This matters because the bottleneck in training large models is increasingly data movement, not raw compute.

Andrej Karpathy has been vocal about the fact that GPUs aren’t going anywhere as the substrate for AI — the architectural fit between transformers and GPU parallelism is too good. His point is essentially that we haven’t hit the wall where GPUs stop scaling, so there’s no forcing function to abandon them. That’s Nvidia’s moat in one sentence.

Who’s Actually Challenging Nvidia — and How Seriously

The honest answer: there are real challengers, but none of them are close to unseating Nvidia for frontier model training right now. Here’s the actual landscape:

Competitor Key Hardware Realistic Threat Level The Catch
AMD MI300X, MI325X Medium — real traction for inference ROCm software stack still lags CUDA; ecosystem adoption slow
Google (TPUs) TPU v5p, Trillium (v6) High — but only inside Google TPUs are mostly captive to Google’s own workloads; limited external availability
Amazon (Trainium) Trainium2 Low-Medium for training Narrow software support; mostly used by Amazon internally and a few large customers
Cerebras CS-3 (wafer-scale) Niche but real for specific workloads Extremely expensive; not a general-purpose replacement
Groq LPU (Language Processing Unit) Medium for fast inference only Not competitive for training; inference-specific architecture
Tenstorrent Grayskull, Wormhole Early stage Interesting architecture (Jim Keller is involved), but limited production deployment

AMD is the most interesting story here. The MI300X has gotten real traction for inference workloads — Microsoft deployed it for some Azure inference capacity, and several hyperscalers are running mixed fleets. But the gap in software tooling is significant. When you ask ML engineers why they don’t use AMD, the answer is almost always ROCm, AMD’s CUDA equivalent. It works, but it’s not as polished, the documentation lags, and most tutorials and libraries assume CUDA. That’s not a hardware problem — it’s a software ecosystem problem. It’s fixable, but fixing it takes years and developer trust, not quarters.

Google’s TPUs are genuinely impressive and probably the most credible technical alternative for transformer training. The Trillium chips reportedly offer better performance-per-watt than Blackwell on certain workloads. But Google builds them for Google. They’re available on GCP, but external adoption is limited. Most AI labs default to Nvidia because that’s what their researchers know, that’s what the open-source tooling targets, and that’s what they can get in volume from multiple cloud providers simultaneously.

The Custom Silicon Play: Why Big Tech Is Building Its Own Chips

The most strategically interesting story in compute right now isn’t AMD or Groq. It’s the hyperscalers building their own silicon to reduce Nvidia dependence. This is a multi-year bet with real stakes.

Apple showed the world what’s possible with custom silicon when it launched M1 in 2020. Performance-per-watt gains that x86 hadn’t seen in a decade. The AI chip makers watched that and started asking: what if we built something

Ty Sutherland

Ty Sutherland is the Chief Editor of AI Rising Trends. Living in what he believes to be the most transformative era in history, Ty is deeply captivated by the boundless potential of emerging technologies like the metaverse and artificial intelligence. He envisions a future where these innovations seamlessly enhance every facet of human existence. With a fervent desire to champion the adoption of AI for humanity's collective betterment, Ty emphasizes the urgency of integrating AI into our professional and personal spheres, cautioning against the risk of obsolescence for those who lag behind. "Airising Trends" stands as a testament to his mission, dedicated to spotlighting the latest in AI advancements and offering guidance on harnessing these tools to elevate one's life.

Recent Posts