Most conversations about AI agents are still happening in the future tense. But a growing number of businesses — from Fortune 500s to 10-person startups — are running agents in production right now, and some of them are genuinely working. Not demos. Not proofs of concept collecting dust. Actual deployed systems handling real workloads, saving real money, and occasionally doing things their builders didn’t fully anticipate. The gap between “AI agents are coming” and “AI agents are here” closed faster than most people expected, and 2025 was the year the receipts started showing up.
What “Working” Actually Means in This Context
Before getting into specific deployments, it’s worth being precise about the word “working.” A lot of agent deployments are technically functional but economically marginal — they automate something that wasn’t really a bottleneck, or they require so much human supervision that the ROI is questionable. That’s not nothing, but it’s not the same as a deployment that demonstrably reduces headcount requirements, accelerates a core workflow by a measurable factor, or unlocks something the business literally couldn’t do before at scale.
The deployments worth paying attention to share a few characteristics: they’re operating in a constrained, well-defined domain; they have clear success metrics; they’ve survived contact with real-world messiness (edge cases, bad inputs, system failures); and the humans overseeing them have figured out where to trust the agent and where to verify. Andrej Karpathy has made the point that current LLMs are like “a brilliant intern who just started” — capable and fast, but requiring thoughtful supervision structures. The businesses getting real results have internalized that framing and built accordingly.
Customer Support: The Highest-Volume Success Story
If there’s one domain where AI agents have clearly crossed the threshold from experiment to infrastructure, it’s customer support. Klarna’s deployment of Intercom-powered agents handling the equivalent of 700 full-time agents’ workload became one of the most cited examples of 2024-2025, and while some of the headline numbers deserve scrutiny, the underlying dynamic is real: for high-volume, text-based customer interactions with well-documented resolution paths, agents are now genuinely cost-effective at scale.
Salesforce’s Agentforce platform has been deployed by companies like Wiley (academic publishing) and OpenTable to handle first-contact resolution on common support queries. What makes these deployments work isn’t magic — it’s that customer support is structurally suited for agents. The inputs are relatively constrained (someone has a problem with an order, a subscription, an account), the resolution paths are documentable, and the cost of a wrong answer is usually recoverable (escalate to a human). The agent doesn’t need to be perfect; it needs to be right often enough and smart enough to know when it’s not.
Zendesk’s AI agents, built on their acquisition of Ultimate.ai, are now handling tens of millions of support tickets per month across their customer base. The realistic headline number from deployments that have published data: 60-80% automated resolution on tier-1 support, with human agents handling the remainder. That’s not replacing support teams — it’s dramatically changing their composition and what those humans spend time on.
Software Development: Where Agents Are Moving Fastest
The software development use case is where agent capabilities are advancing most visibly, and where the gap between what’s possible and what’s deployed in production is currently narrowest. A few specific deployments:
Cursor + Claude Sonnet is now the standard development environment for a meaningful portion of early-adopter engineering teams. This isn’t just autocomplete — teams are using Cursor’s Agent mode to handle full feature implementations from a spec, debug production issues by feeding in error logs and codebase context, and write tests. The honest picture: it works well for greenfield features in well-documented codebases, struggles with deeply entangled legacy systems, and still requires a competent engineer in the loop to catch hallucinated function calls and logic errors.
GitHub Copilot Workspace takes this further — you describe a task in natural language, it generates a plan, proposes code changes across multiple files, and you review before committing. Early adopters at companies like Accenture report meaningful acceleration on well-scoped tasks. The caveat is that “well-scoped” is doing a lot of work in that sentence.
Devin from Cognition AI has been deployed at a handful of companies for specific narrow tasks — particularly writing boilerplate, handling minor bug fixes, and updating documentation. The real-world performance on complex engineering tasks has been more modest than the initial demo suggested, but on the narrow tasks it’s been pointed at, it delivers. This is a pattern worth generalizing: agents that are deployed against their actual current capabilities, not their theoretical future ones, tend to work.
Back-Office Automation: The Quiet Wins
The least glamorous and arguably most economically significant agent deployments are happening in back-office operations — the unglamorous work of moving data between systems, processing documents, and managing workflows that previously required armies of coordinators.
Accounts payable and invoice processing is a category where companies like Stampli and BILL have deployed AI agents that can extract data from invoices, match against purchase orders, flag exceptions, and route approvals — with minimal human intervention on clean inputs. The scale at which this is operating is meaningful: BILL processes over $300 billion in payment volume annually, and a substantial portion of that document processing is now agent-assisted.
Legal document review has seen real deployment at mid-size firms using tools like Harvey AI (built on GPT-4 class models, specifically fine-tuned on legal corpora) and Ironclad for contract management. The use case isn’t replacing lawyers — it’s handling the first-pass review that associates used to spend hours on: flagging non-standard clauses, summarizing NDAs, identifying missing provisions. Allen & Overy (now A&O Shearman) was an early Harvey adopter and has been public about the time savings on document review tasks.
Data pipeline maintenance is an emerging category — agents that monitor data pipelines, detect anomalies, write and test fixes, and alert humans only when they’ve exhausted their remediation playbook. Startups like Sifflet and Monte Carlo have moved in this direction, and engineering teams at data-heavy companies are experimenting with custom agents built on the OpenAI Assistants API or Anthropic’s Claude API for this purpose.
What Separates Deployments That Work From Ones That Don’t
After looking at a wide range of deployments — successful and failed — there are consistent patterns on both sides. Here’s a framework for thinking about agent deployment readiness:
| Factor | High Success Signal | High Risk Signal |
|---|---|---|
| Task definition | Clear inputs, clear success criteria, bounded scope | Fuzzy goals, requires judgment calls on values |
| Error cost | Recoverable — human can catch and correct | Irreversible — wrong action has real consequences |
