Why NVIDIA’s Growth Is Accelerating, Not Slowing, in 2026


green-leafed plant

NVIDIA’s stock was down 30 cents on the day Jensen Huang sat down at GTC 2026 to explain why the company’s growth is actually accelerating. He wasn’t particularly bothered. When you’ve watched your stock climb roughly 22,000% over the prior decade, a slow Tuesday is just a slow Tuesday. What Huang laid out at that conference — in front of 3,500 attendees representing a combined $40 trillion in market cap — wasn’t a defense of a hot stock. It was a structural argument about why the demand for compute is nowhere near a ceiling, and why most people are still misreading the moment.

One attendee told Huang his latest earnings print might be the single best in recorded human history. Huang’s response: “It must be only recorded humanity. I’m sure somebody had better returns.” That combination — genuine humility plus total confidence in the underlying thesis — is worth paying attention to. Because the thesis is specific, falsifiable, and worth understanding in full.

What “The Inference Inflection” Actually Means

There are two phases of AI compute demand: training and inference. Training is what most people think about when they imagine AI — massive GPU clusters running for weeks or months to produce a model. GPT-4 got trained. Gemini got trained. Llama got trained. That’s training compute.

Inference is everything after that. Every time you ask Claude a question, every time Copilot generates a line of code, every time an AI agent inside Salesforce autonomously updates a CRM record, every token produced in response to any query anywhere — that’s inference. And inference is where the volume lives. Training happens once (or periodically). Inference happens billions of times per day.

Huang specifically named a new phase at GTC 2026: what he called “the inflection of inference” — or the inference inflection. His argument is that we’ve crossed a threshold where inference demand is large enough, and growing fast enough, that it is now the primary engine driving NVIDIA’s business forward. The end markets, in his framing, are “really growing” — not flattening, not plateauing, not cooling off. And because inference scales with usage, and usage is compounding as AI gets embedded into more products and workflows, the demand curve bends upward, not sideways.

This is why Huang said directly: “Our growth is accelerating at a larger scale. That’s surprising for people.” It surprises people because conventional wisdom about hardware companies assumes some saturation point — you build the data centers, you fill them, growth slows. The inference inflection breaks that model. Every new AI feature, every new agent, every new software product that runs on tokens creates new ongoing inference demand. There is no “done.” This shift is part of a broader pattern that many analysts have called the AI inflection point — a moment where the technology stops being experimental and starts being structural.

Compute Equals Revenues. Full Stop.

Huang’s first structural argument is the simplest and the most direct: “Every single company will need compute for revenues.”

The chain he’s describing goes: compute → intelligence → digital workforce → revenues. This isn’t futurism. Huang frames it as something already happening. Companies are deploying AI agents that do real work — handling customer queries, processing documents, writing and reviewing code, running analyses that would have required teams of people. That work requires inference. Inference requires compute. Compute requires NVIDIA (or a competitor, but predominantly NVIDIA at scale today).

The implication is that compute stops being a cost center and becomes a revenue-generating asset. If your AI agents are closing deals, resolving support tickets, or accelerating your engineering output, then every dollar of compute you buy has a calculable return attached to it. That reframes the purchasing decision entirely. It’s not “how much can we afford to spend on infrastructure?” It’s “how much revenue are we leaving on the table by not buying more compute?”

That’s a fundamentally different demand dynamic than, say, storage or networking. And it’s why Huang said, with unusual directness: “You can’t hold the stock back. You can’t hold it back.” He’s not talking about momentum trading. He’s talking about a structural connection between compute purchase and revenue generation that creates sustained, growing demand regardless of market sentiment on any given Tuesday.

The Internet Already Proved This Works

One of the strongest parts of Huang’s argument is that it’s no longer theoretical. The major cloud service providers — Meta, Google, AWS — have already run the experiment at scale and gotten the answer.

His framing: all major CSPs took their entire capital expenditure and converted it to generative and agentic AI infrastructure. Why? Because AI improves every single thing the internet does. Search gets better. Shopping recommendations get better. Ad targeting gets better. Social feeds get better. Every core internet business case — the ones that have generated trillions in value — improves with AI, and the companies running those businesses have demonstrated it with real revenue.

Huang’s direct quote here is worth sitting with: “The entire internet industry could take 100% of their capex and make it AI because it’s better. We’ve proven it to be better.”

The word “proven” is doing serious work in that sentence. This isn’t a pitch. Meta has disclosed the returns on its AI infrastructure investment in recommendation systems. Google has integrated AI into search, shopping, and ads and watched engagement and monetization metrics move. AWS is reselling AI compute capacity faster than it can build data centers. These are companies with the world’s most sophisticated CFOs running the numbers. They converted because the ROI was there, not because of enthusiasm.

For anyone trying to understand where we are in this cycle, that’s actually the most important signal: the biggest, most analytically rigorous buyers in the world ran the test and kept buying. They didn’t slow down after an initial deployment phase. They accelerated. The compounding effect of that investment is something Peter Diamandis’s abundance thesis anticipated — the idea that AI returns don’t taper once deployed, they multiply.

Every Software Company Becomes a Compute Buyer

Here’s where the argument gets interesting for people who aren’t thinking about hyperscalers. Huang’s fourth thesis is about the rest of the software industry — the Salesforces, the SAPs, the Oracles, the ServiceNows, the thousands of SaaS companies that don’t currently think of themselves as compute buyers at scale.

His framing: “The entire software industry will be token driven.”

Every software company, in Huang’s model, ends up in one of two positions: they either produce tokens themselves (running AI models to generate outputs for their users), or they resell tokens (acting as a distribution layer on top of AI infrastructure). There is no third option where a software company just stays as it is, offering static features and traditional workflows, and remains competitive.

If you produce tokens, you need compute. If you resell tokens, you are buying them from someone who needs compute. Either path leads to compute demand. As Huang put it: “For the first time, the entire IT industry will have to be fueled by compute.&#

How to Think About Inference Costs as a Product Economic

Most founders treat inference cost as a line item to minimize. That’s the wrong frame. The right frame is: what does one token of inference buy me in terms of user value, and does that math compound in my favor or against me?

Here’s what the actual pricing looks like right now, so you can do real math:

Model Input (per 1M tokens) Output (per 1M tokens) Best for
GPT-4o $2.50 $10.00 Complex reasoning, multimodal
GPT-4o mini $0.15 $0.60 High-volume, cost-sensitive tasks
Claude 3.5 Haiku $0.80 $4.00 Fast, capable, mid-tier volume
Claude 3.5 Sonnet $3.00 $15.00 High-quality generation, agents
Gemini 1.5 Flash $0.075 $0.30 Cheapest capable option at scale
Llama 3.1 70B (self-hosted) ~$0.10–0.30 (compute cost) ~$0.10–0.30 High volume if ops overhead is worth it

Now work through a concrete example. Say you’re building a B2B SaaS tool that auto-drafts customer-facing emails. A typical use case: user clicks a button, your system pulls CRM context (roughly 800 input tokens), generates a draft (roughly 400 output tokens), user edits and sends.

On GPT-4o, that single interaction costs you: (800 × $2.50 / 1,000,000) + (400 × $10.00 / 1,000,000) = $0.002 + $0.004 = $0.006 per draft.

If your user generates 50 drafts a month and pays $49/month, your gross inference cost for that user is $0.30. That’s 0.6% of revenue. Totally fine. You’re not in a compute cost problem — you’re in a growth problem.

Now change the product: you’re building an AI agent that monitors a customer’s inbox 24/7, triages tickets, and generates responses autonomously. Same user might generate 2,000 inference calls per month at similar token counts. Now you’re at $12/month in inference cost against a $49 price point. That’s 24% of revenue before you pay for anything else. That’s a real problem, and it changes your model selection, your architecture, and possibly your pricing.

The number that matters is inference cost as a percentage of gross revenue per user. A healthy AI-native SaaS product wants this below 10% at scale. Above 20% and you are building a GPU reselling business with product features on top, and the economics only work if you’re also charging like it.

A Decision Framework: Rent vs. Own Compute, and When It Flips

The inference inflection creates a real operational decision for any team hitting scale: at what point does renting inference from OpenAI, Anthropic, or Google stop making sense, and when does running your own infrastructure — either self-hosted open weights models or reserved GPU capacity — become the better bet?

Here’s a framework to work through it. Answer these four questions:

1. What is your monthly inference spend right now?

Pull your actual API bills. If you’re under $5,000/month total, stop reading this section — you are nowhere near the crossover point. Optimization here is premature and will cost you engineering time worth more than what you’d save.

If you’re between $5,000 and $30,000/month, you’re in the evaluation zone. If you’re above $30,000/month, you should have already run this math.

2. Is your workload latency-sensitive and model-agnostic, or quality-dependent?

Not every inference task needs GPT-4o. Classification, extraction, summarization of structured data, and routing decisions are almost always fine on a smaller open-weight model like Llama 3.1 8B or Mistral 7B, which you can run on a single A100 for roughly $2–3/hour on Lambda Labs or Together AI. A single A100 can handle roughly 1–2 million tokens per hour at those model sizes. If your workload is 10 million tokens/day of classification tasks, that’s roughly 5–10 A100-hours/day, or $10–30/day, versus $150–750/day on GPT-4o mini. That gap funds an ML engineer’s time.

But if your product’s core value prop is output quality — and users would notice and churn if quality dropped — switching models to save money is cutting the wrong thing.

3. Do you have the operational overhead to run infrastructure?

Self-hosting is not free. Budget at minimum 0.25 FTE of engineering time to manage model serving, uptime, versioning, and cost monitoring if you go this route. If your team is under 10 people, that is a significant tax. Factor it in.

A middle path that often gets overlooked: providers like Together AI, Fireworks AI, and Replicate let you run open-weight models on their infrastructure at per-token pricing that can be 5–10x cheaper than OpenAI for equivalent model sizes, without you managing any servers. Fireworks AI currently prices Llama 3.1 70B at $0.90 per million input tokens and $0.90 per million output tokens. That is the same model class as Claude 3.5 Haiku at nearly one-fifth the cost for many workloads.

4. What is the revenue threshold where owning makes sense?

A reserved A100 on CoreWeave or Lambda Labs runs roughly $2.00–2.50/hour on an annual contract, or about $1,500–1,800/month per GPU. One H100 reserved annually runs roughly $2.50–3.50/hour, or $1,800–2,500/month. If your current API spend on a replaceable workload exceeds $3,000/month and you have the ops capacity to manage it, self-hosting or reserved inference capacity starts to pencil out. If your spend is $10,000+/month on tasks a Llama-class model handles adequately, it almost certainly pencils out and the question is just execution risk.

Use this as your decision checklist:

  • Monthly API inference spend above $5,000: start auditing which workloads are model-agnostic
  • Monthly API inference spend above $15,000: run a serious cost model on Together AI or Fireworks as an intermediate step
  • Monthly API inference spend above $30,000: reserved GPU capacity or hybrid routing should be in your architecture plan
  • Workload is latency-tolerant and output-quality-flexible: open-weight models on managed inference providers first
  • Workload is quality-critical and user-facing: stay on frontier APIs, optimize prompt length instead
  • Team under 10 people: avoid self-managed GPU infrastructure, use managed open-weight serving instead
  • How to Think About Inference Costs as a Product Economic Input

    Most founders and operators treat inference costs as a line item to minimize. That’s the wrong frame. Inference cost is a unit economics variable that determines whether your AI-powered product can ever be profitable at scale. Here’s how to think through it concretely.

    Start with what you’re actually paying per token today. As of mid-2025, the major tiers look roughly like this:

    Model Input (per 1M tokens) Output (per 1M tokens) Best for
    GPT-4o $5.00 $15.00 High-accuracy tasks, customer-facing
    GPT-4o mini $0.15 $0.60 High-volume, latency-sensitive tasks
    Claude 3.5 Sonnet $3.00 $15.00 Long-context reasoning, document work
    Claude 3 Haiku $0.25 $1.25 Classification, summarization, routing
    Gemini 1.5 Flash $0.075 $0.30 Multimodal, cheap background tasks
    Llama 3 70B (via Groq) $0.59 $0.79 Speed-first, open weight flexibility

    Now work backwards from your product. If a user session in your app generates roughly 2,000 input tokens and 500 output tokens, and you’re running on GPT-4o, that session costs you about $0.017 in inference alone. That sounds trivial. But if your product is priced at $20 per month and a heavy user runs 200 sessions per month, your inference cost for that user is $3.40 — before hosting, support, or any other variable cost. Your gross margin is already structurally constrained.

    The worked example that actually matters:

    1. Define your median session token consumption (input + output separately — output costs more).
    2. Multiply by your 90th percentile user’s monthly session count. This is your worst-case inference cost per user per month.
    3. Divide that number into your monthly revenue per user. If inference alone exceeds 20% of that figure at the 90th percentile, you have a unit economics problem that will get worse as your best users engage more.
    4. Now run the same calculation with a cheaper model. If your 90th percentile cost drops below 8% of monthly revenue per user using Claude Haiku or GPT-4o mini, you have a routing opportunity — serve power users with a cheaper model for high-volume tasks and reserve the expensive model for the moments that actually require it.

    This is what model routing is for. It’s not an optimization you do later. It’s a product architecture decision you make at the start, because inference costs at scale are not linear — they compound with your best users, who are also usually your loudest advocates and hardest to churn.

    The Own-vs-Rent Decision Framework for Compute

    The inference inflection creates a specific decision point that most operators hit without a clear framework: at what scale does renting inference from a cloud API become more expensive than owning or co-locating the compute yourself?

    Here is a direct way to think through it.

    Stage 1: Under $50K/month in inference spend — rent everything

    Below this threshold, the operational overhead of managing your own GPU infrastructure — hiring the ML platform engineers, dealing with CUDA compatibility issues, managing uptime — costs more than the premium you’re paying to OpenAI or Anthropic. Use the APIs. Optimize model selection and prompt efficiency instead.

    Stage 2: $50K–$300K/month in inference spend — audit and route aggressively

    This is where most scaling startups live longer than they should without making changes. At $100K/month on GPT-4o, you can often cut that to $30K–$40K by doing three things: routing classification and retrieval tasks to cheaper models, caching repeated prompt patterns with something like GPTCache or Semantic Cache in LangChain, and shortening system prompts. A 60% cost reduction at this stage doesn’t require owning any hardware — it requires engineering discipline.

    Specific audit checklist for this stage:

    • What percentage of your inference calls use the same system prompt verbatim? If it’s above 40%, implement prefix caching. Anthropic’s API supports prompt caching natively; it reduces cost on repeated context by up to 90%.
    • Are you logging token counts per call type? If not, you’re flying blind. Add token logging to every inference call this week before doing anything else.
    • What’s your input-to-output token ratio? If input tokens dominate, you’re paying for context you may be able to compress or cache. If output tokens dominate, consider whether streaming and truncation thresholds are set correctly.
    • Are any of your inference calls classifying, routing, or tagging — tasks that a fine-tuned small model handles at 95% accuracy for one-tenth the price? Identify those call types and separate them from your reasoning-heavy calls.

    Stage 3: Above $300K/month in inference spend — model the own-vs-rent math directly

    At this level, dedicated inference capacity starts to make financial sense. A single H100 SXM5 on-demand from AWS (p5 instances) runs roughly $32 per hour, or about $23,000 per month. A reserved instance drops that by 30–40%. If your inference workload is predictable and consistent rather than spiky, reserved or owned hardware pays back in 12–18 months at this spend level.

    The calculation to run:

    1. Take your current monthly API inference spend.
    2. Estimate what throughput (tokens per second) you need at peak and at median load.
    3. Size the H100 or H200 cluster that covers your median load. Overflow to API for spikes.
    4. Add $150K–$250K annually per ML infrastructure engineer required to run it.
    5. Compare total annual cost of ownership against your current API run rate projected forward 24 months, assuming 40% annual usage growth.

    If owned infrastructure is cheaper over 24 months even after full engineering cost, the decision is clear. If it’s close, stay rented — operational risk and distraction have real costs that don’t show up in the spreadsheet. Most companies that move to owned inference too early underestimate how much engineering attention it pulls away from the product itself.

    The inference inflection means this calculation becomes relevant for more companies, sooner, than the previous generation of SaaS operators ever had to think about. Running the numbers explicitly — rather than assuming APIs are always cheaper or always more expensive — is the actual work.

Ty Sutherland

Ty Sutherland is the Chief Editor of AI Rising Trends. Living in what he believes to be the most transformative era in history, Ty is deeply captivated by the boundless potential of emerging technologies like the metaverse and artificial intelligence. He envisions a future where these innovations seamlessly enhance every facet of human existence. With a fervent desire to champion the adoption of AI for humanity's collective betterment, Ty emphasizes the urgency of integrating AI into our professional and personal spheres, cautioning against the risk of obsolescence for those who lag behind. "Airising Trends" stands as a testament to his mission, dedicated to spotlighting the latest in AI advancements and offering guidance on harnessing these tools to elevate one's life.

Recent Posts