How to Tell If Your AI Agent Is Wasting Money
I talked to a team last month that was spending $1,200/month on an AI agent that answered shipping questions. The agent worked fine. Customers were happy. But when we dug into their token logs, 40% of their spend was the agent re-fetching the same tracking data it had already retrieved earlier in the conversation. The agent was having a conversation with itself about information it already had, and billing them for the privilege.
This kind of waste is everywhere. Most teams have no idea it's happening because they look at one number: the monthly LLM bill. That number tells you almost nothing useful.
## The Five Cost Traps
After watching hundreds of agent deployments through ClawTrait, I keep seeing the same patterns drain budgets.
**Redundant API calls** are the most common. An agent calls an external API, gets a result, then three turns later calls the same API with the same parameters because the result fell out of its context window. Or worse, the agent calls the API in a loop, re-verifying data it just verified. One team had an agent making 14 Stripe API calls per customer interaction when 3 would have been enough. That's not a rounding error. At scale, it was costing them $400/month in unnecessary token spend just on the API call descriptions and responses.
**Oversized models for simple tasks** is the second biggest trap. If your agent uses Claude 3.5 Sonnet to classify whether a message is a billing question or a technical question, you're paying roughly $0.03 for something Haiku can do for $0.005. The quality difference for simple classification is negligible. I've tested this across thousands of routing decisions and the accuracy gap between Sonnet and Haiku for binary classification is under 2%.
**Runaway loops** are rarer but expensive when they happen. An agent hits an error, retries, hits the same error, retries again. Without a circuit breaker, this can burn through $50-100 in tokens before anyone notices. I've seen a single stuck conversation cost more than the agent's entire weekly budget.
**Bloated system prompts** are a slow leak. Every token in your system prompt gets sent with every API call. A 4,000-token system prompt costs roughly $0.012 per call on Sonnet. If your agent handles 500 conversations per day with an average of 6 API calls each, that's $36/day just for the system prompt. Trim it to 1,500 tokens and you save $22/day, or about $660/month.
**Failed tasks that still cost money** round out the list. When an agent attempts a task, fails, and escalates to a human, you've paid for the tokens but gotten zero value. If your failure rate is 20%, one-fifth of your token spend is pure waste. Most teams don't separate successful-task costs from failed-task costs, so they never see this.
## How to Measure Cost-Per-Task
Stop looking at monthly totals. The metric that matters is cost per successful task completion.
Here's how to calculate it:
Total monthly token cost + infrastructure cost = total cost. Divide by the number of tasks completed successfully without human intervention. That's your real unit cost.
For the shipping-question team I mentioned earlier, their headline number was $0.45 per interaction. After we separated successful completions from failures and escalations, their real cost per resolved question was $0.72. Almost double what they thought.
ClawTrait breaks this down automatically: cost per attempt versus cost per success, split by task type. The difference between those two numbers is your waste margin.
## What to Watch For
Set up alerts on these three things:
**Token spend per conversation that exceeds 2x your median.** This catches runaway loops and redundant API calls. If your median conversation costs $0.30, any conversation hitting $0.60+ deserves investigation.
**Model usage by task complexity.** Tag your tasks as simple, medium, or complex. If more than 30% of your expensive-model calls are on simple tasks, you're overspending. Route those to a cheaper model.
**Failure rate by task type.** Some task types will have a 5% failure rate. Others might be at 35%. The high-failure tasks are where your money disappears. Either fix them or stop running them through the agent.
## The Uncomfortable Math
Here's what nobody wants to hear: some agents shouldn't exist. If your agent has a 40% failure rate on a task type that represents half its volume, and fixing the prompts hasn't helped after three iterations, that task type might just be too hard for current models. Routing it to a human might be cheaper than the combined cost of failed attempts plus successful attempts plus the engineering time you spend trying to improve it.
I know that's not the exciting AI-future narrative. But tracking real numbers forces real decisions. ClawTrait shows you the numbers. What you do with them is up to you.