AI Cost Control: Routing, Effort, and Caching

Two teams can run the same AI workload and get bills that differ by ten times. Same models, same tasks, same output quality. The difference is not the model they chose. It is three habits the cheaper team has and the expensive team does not: routing, effort tuning, and caching. None of them is technical enough to need an engineer. All of them are ignored by most people paying frontier-model prices.

I started paying attention to this when a fractional COO client’s AI spend tripled in a quarter with no change in what they were producing. Nothing was broken. They had simply defaulted everything to the most powerful model, left every dial at maximum, and re-sent the same long context on every call. Fixing those three things cut the bill by roughly 70% and changed nothing a user could see. Here is what each skill is and how to build it.

Routing: send each task to the cheapest model that can do it

Routing is deciding, per task, which model handles it, instead of defaulting everything to the flagship.

The rule that drives it is one question: can you verify the output yourself? If you can read the result and judge whether it is right, a cheap model is the correct choice, because the extra horsepower of a frontier model buys you nothing on work you are going to check anyway. Reserve the expensive tiers for long, complex, hard-to-verify jobs where a subtle error is costly and you cannot personally catch it.

The price gaps make this the biggest lever you have. Claude Fable 5 costs $10 per million input tokens and $50 per million output; Sonnet 4.6 is a fraction of that; a mini model is cheaper still. Routing a task from the flagship to the right workhorse is often a 20-to-1 cost change with no quality loss. The practical version is an escalation pattern: default to the cheap model, add a quick check, and escalate only the cases that fail. I cover the cross-provider mechanics in the Claude, GPT, or Gemini routing guide, but the habit matters more than the tooling. You can route well by hand.

Effort: stop paying for thinking that doesn’t change the answer

Modern models let you set how hard they think. Claude has effort levels; GPT-5.5 exposes low, medium, high, and xhigh; the others have their own versions. The setting trades latency and tokens for depth of reasoning.

Most people leave it at maximum out of a vague sense that more is safer. It is not safer; it is just slower and pricier. The right default is the standard or medium level for routine work, with the top tiers reserved for genuinely hard analysis where capturing every nuance matters and a missed detail is expensive. As Simon Willison observed in his early testing of Fable 5, the gains from more reasoning are real but concentrated in the hard cases. On easy cases, cranking effort costs you and returns nothing.

The discipline is the same as routing, applied to a dial instead of a model: spend more only where it changes the outcome. Once you internalize that, you reach for high effort deliberately, on the few tasks that earn it, rather than leaving it on by accident across everything.

Caching: stop paying full price to re-send the same context

This is the one almost nobody uses, and it is often the easiest win.

Prompt caching lets the provider store a chunk of context you reuse across many calls (a long system prompt, a style guide, a codebase, a reference document) and charge you a steep discount when it appears again instead of full input price every time. If you are running hundreds of calls that all begin with the same 20,000-token preamble, caching can cut the input cost of that preamble dramatically after the first call.

The pattern to look for: any workload where a large, stable block of context repeats. Customer support bots that reload the same product manual on every query. Coding agents that re-read the same codebase. Content pipelines that prepend the same brand guidelines to every generation. All of those are paying full freight for tokens the model has effectively already seen. Structure the prompt so the stable context comes first and is marked cacheable, keep the variable part at the end, and the provider does the rest. Most major APIs support some form of this now, and the savings compound with volume.

Why these three and not a dozen

There are plenty of advanced techniques, but routing, effort, and caching are the ones with the best ratio of savings to difficulty. They require no model fine-tuning, no infrastructure, and no engineering team. They are decisions: which model, how hard, and what to cache. That is why they are the line between people who treat AI as a flat monthly cost and people who treat it as something they actually control.

If you do nothing else, write down a routing rule for your team and turn the effort dial down to medium as your default. Those two changes alone will move your bill more than any vendor negotiation. Add caching on your highest-volume repeated workload and you have captured most of the savings available without touching the quality of a single output.

For the model-by-model version of the routing call, including exactly when the frontier tier is worth its price, see how to get the most out of Claude Fable 5.

The Three Skills That Separate AI Power Users: Routing, Effort, and Caching

Routing: send each task to the cheapest model that can do it

Effort: stop paying for thinking that doesn’t change the answer

Caching: stop paying full price to re-send the same context

Why these three and not a dozen

Recent Posts