O’Reilly Media – Control Codegen Spend

This article originally appeared on Medium. Tim O’Brien has given us permission to repost here on Radar.

When you’re working with AI tools like Cursor or GitHub Copilot, the real power isn’t just having access to different models—it’s knowing when to use them. Some jobs are OK with Auto. Others need a stronger model. And sometimes you should bail and switch if you continue spending money on a complex problem with a lower-quality model. If you don’t, you’ll waste both time and money.

And this is the missing discussion in code generation. There are a few “camps” here; the majority of people writing about this appear to view this as a fantastical and fun “vibe coding” experience, and a few people out there are trying to use this technology to deliver real products. If you are in that last category, you’ve probably started to realize that you can spend a fantastic amount of money if you don’t have a strategy for model selection.

Let’s make it very specific—if you sign up for Cursor and drop $20/month on a subscription using Auto and you are happy with the output, there’s not much to worry about. But if you are starting to run agents in parallel and are paying for token consumption atop a monthly subscription, this post will make sense. In my own experience, a single developer working alone can easily spend $200–$300/day (or four times that figure) if they are trying to tackle a project and have opted for the most expensive model.

And, if you are a company and you give your developers unlimited access to these tools. Get ready for some surprises.

My Escalation Ladder for Models…

Start here: Auto. Let Cursor route to a strong model with good capacity. If output quality degrades or the loop occurs, escalate the issue. (Cursor explicitly says Auto selects among premium models and will switch when output is degraded.)

Medium-complexity tasks: Sonnet 4/GPT‑5/Gemini. Use for focused tasks on a handful of files: robust unit tests, targeted refactors, API remodels.

Heavy lift: Sonnet 4 – 1 million. If I need to do something that requires more context, but I still don’t want to pay top dollar, I’ve been starting to move up models that don’t quickly max out on context.

Ultraheavy lift: Opus 4/4.1. Use this out when the task spans multiple projects or requires long context and careful reasoning, then switch back once the big move is done. (Anthropic positions Opus 4 as a deep‑reasoning, long‑horizon model for coding and agent workflows.)

Auto works fine, but there are times when you can sense that it’s selected the wrong model, and if you use these models enough, you know when you are looking at Gemini Pro output by the verbosity or the ChatGPT models by the way they go about solving a problem.

I’ll admit that my heavy and ultraheavy choices here are biased towards the models I’ve had more experience with—your own experience might vary. Still, you should also have a similar escalation list. Start with Auto and only upgrade if you need to; otherwise, you are going to learn some lessons about how much this costs.

Watch Out for “Thinking” Model Costs

Some models support explicit “thinking” (longer reasoning). Useful, but costlier. Cursor’s docs note that enabling thinking on specific Sonnet versions can count as two requests under team request accounting, and in the individual plans, the same idea translates to more tokens burned. In short, thinking mode is excellent—use it when you need it.

And when do you need it? My rule of thumb here is that when I understand what needs to be done already, when I’m asking for a unit test to be polished or a method to be executed in the pattern of another… I usually don’t need a thinking model. On the other hand, if I’m asking it to analyze a problem and propose various options for me to choose from, or (something I do often) when I’m asking it to challenge my decisions and play devil’s advocate, I will pay the premium for the best model.

Max Mode and When to Use It

If you need giant context windows or extended reasoning (e.g., sweeping changes across 20+ files), Max Mode can help—but it will consume more usage. Make Max Mode a temporary tool, not your default. If you find yourself constantly requiring Max Mode to be turned on, there’s a good chance you are “overapplying” this technology.

If it needs to consume a million tokens for hours on end? That’s usually a hint that you need another programmer. More on that later, but what I’ve seen too often are managers who think this is like the “vibe coding” they are witnessing. Spoiler alert: Vibe coding is that thing that people do in presentations because it takes five minutes to make a silly video game. It’s 100% not programming, and to use codegen, here’s the secret: You have to understand how to program.

Max Mode and thinking models are not a shortcut, and neither are they a replacement for good programmers. If you think they are, you are going to be paying top dollar for code that will one day have to be rewritten by a good programmer using these same tools.

Most Important Tip: Watch Your Bill as It Happens

The most important tip is to regularly monitor your utilization and usage fees in Cursor, since they appear within a minute or two of running something. You can see usage by the minute, the number of tokens consumed, and in some cases, how much you’re being charged beyond your subscription. Make a habit of checking a couple of times a day, especially during heavy sessions, and ideally every half hour. This helps you catch runaway costs—like spending $100 an hour—before they get out of hand, which is entirely possible if you’re running many parallel agents or doing resource-intensive work. Paying attention ensures you stay in control of both your usage and your bill.

Keep Track and Avoid Loops

The other thing you need to do is keep track of what works and what doesn’t. Over time, you’ll notice it’s very easy to make mistakes, and the models themselves can sometimes fall into loops. You might give an instruction, and instead of resolving it, the system keeps running the same process again and again. If you’re not paying attention, you can burn through a lot of tokens—and a lot of money—without actually getting sound output. That’s why it’s essential to watch your sessions closely and be ready to interrupt if something looks like it’s stuck.

Another pitfall is pushing the models beyond their limits. There are tasks they can’t handle well, and when that happens, it’s tempting to keep rephrasing the request and asking again, hoping for a better result. In practice, that often leads to the same cycle of failure, except you’re footing the bill for every attempt. Knowing where the boundaries are and when to stop is critical.

A practical way to stay on top of this is to maintain a running diary of what worked and what didn’t. Record prompts, outcomes, and notes about efficiency so you can learn from experience instead of repeating expensive mistakes. Combined with keeping an eye on your live usage metrics, this habit will help you refine your approach and avoid wasting both time and money.