← Back to Blog

The $200 Unlimited Plan Is Dead — and Your Procurement Team Should Care More Than Your Engineers

Most pieces about hybrid AI focus on the engineering: model routing, token economics, edge inference. This isn't one of those. The "AI is going hybrid" thesis is becoming consensus — every vendor blog, every analyst report, every conference keynote. I'll cover it briefly because the foundation matters, but the more interesting questions are downstream. Specifically: what does this mean for the multi-year cloud commit on your CFO's desk, the laptop refresh your IT team just approved, and the security model your CISO hasn't updated for any of this?

For the last three years, we've all been living off subsidized AI. A $20 ChatGPT Plus subscription. A $200 Claude Max plan with effectively unlimited Opus access. VC money and strategic investments papered over the fact that frontier inference is genuinely expensive — and that "rewrite this email" was costing the same compute as a coding agent refactoring a service.

That era is ending. Quietly, in pricing pages and product launches, the AI industry is rewriting its assumptions. And the answer it's converging on isn't more efficient cloud — it's a hybrid where a lot of compute moves back to where the data already lives: your laptop, your phone, your desk.

This isn't speculation. The pieces are already shipping. But the architectural shift is the easy part to write about. The hard part — the part that's actually going to determine whether your organization wins or loses on AI economics over the next three years — is what you do with that knowledge across functions that don't usually read AI architecture posts. That's where I'm going to spend most of my time.

The economics broke first

In April 2026, Anthropic moved its enterprise pricing from flat-rate tiers to per-token billing. The reason was straightforward: agentic usage broke the math. A chat conversation burns a few hundred tokens per turn. An agent running through your codebase or your inbox burns millions. A heavy Claude Code Max user on the $200 plan was consuming what would cost thousands at API rates.

OpenAI is in the same bind. ChatGPT Plus reportedly loses money on power users. Sam Altman has admitted the company has been "forced to do unnatural things" — borrowing compute from the research division to keep consumer products running. OpenAI's head of ChatGPT publicly mused that "having an unlimited plan is like having an unlimited electricity plan. It just doesn't make sense."

Both labs are reportedly running at roughly 40% gross margins, with inference as the variable cost squeezing them. As both companies move toward IPOs, that margin profile has to improve. The subsidies are coming off.

This is the demand-side pressure. The supply side — your devices — quietly got ready to absorb some of it.

The hardware quietly got ready

While we were arguing about GPT-5 vs. Opus 4.6, Apple shipped the M5 with neural accelerators inside every GPU core. An M5 Max runs Llama-class 70B models at ~35 tokens/sec locally, with 128k context windows, no API key, no rate limit. Qualcomm's Snapdragon X2 hits 80 TOPS on the NPU. Microsoft set 40 TOPS as the minimum floor for Copilot+ certification.

The model side caught up too. OpenAI's gpt-oss-20B, released under Apache 2.0, runs on 16GB of memory and approaches o3-mini quality on reasoning benchmarks. Llama 3.3 70B and Gemma 4 handle production inference for tasks that genuinely required GPT-4-class APIs eighteen months ago. Quantization improved enough that 4-bit models retain most of their quality at a quarter of the memory footprint.

I've been running this setup myself for the last several months — Ollama on a Mac, with a small cast of local models for the routine stuff and cloud APIs reserved for things that actually need frontier reasoning. The honest take: for maybe 70% of what I throw at AI in a day — drafting emails, summarizing meeting notes, code completion, quick lookups, reformatting JSON — a local model on the M-series silicon is fine. Not as good as Opus, but fine. And the latency advantage is real. There's no round trip. It just answers.

The other 30% is where I want Opus 4.7 or GPT-5: complex multi-file refactors, real research synthesis, anything where I'd be embarrassed to ship the local model's output. The split feels obvious in retrospect. We were just paying frontier prices for non-frontier work because the routing didn't exist.

I learned the cost of not routing the hard way one Saturday morning when I burned my entire weekly ChatGPT quota in under four minutes on a single document with default settings. I'll unpack the full story in Part 2 — for now, the short version is that the lesson was free because I was on a flat-rate plan. On per-token enterprise pricing, it would have been an expensive Saturday.

The macro point holds: the routing infrastructure to make these decisions is being built. The question is what the rest of the stack — your contracts, your devices, your security posture, your people — does in response.

Nvidia's interesting balancing act

Here's a wrinkle worth pausing on, because most coverage misses it. Nvidia's Vera Rubin architecture, shipping through 2026, claims a 10x reduction in inference token costs versus Blackwell. That genuinely improves the cloud side of the equation — every Rubin deployment makes the API model more economically defensible. So you might expect Nvidia to lean entirely into data center.

Except they can't. The same Nvidia just announced 35% faster LLM inference on RTX-powered AI PCs and shipped Neotron 3, an edge inference chip explicitly aimed at local deployment. Nvidia is simultaneously making cloud inference cheaper and arming the consumer hardware that competes with it.

This isn't strategic confusion — it's strategic necessity. If Nvidia ignores consumer AI hardware, Apple Silicon, Qualcomm Snapdragon X, and AMD's Ryzen AI Max define the on-device category without them, and Nvidia loses the developer mindshare that makes CUDA sticky. If Nvidia ignores cloud inference cost reduction, hyperscalers accelerate their move to custom silicon (Google's TPUs, Amazon's Trainium, Microsoft's Maia), and Nvidia's data center moat erodes from the other end.

So they're playing both sides, and that's actually good news for the hybrid future. Nvidia's investment in consumer AI throughput legitimizes local inference even further. The economics improve at both ends of the stack simultaneously — which means the routing problem (what runs where) becomes the dominant question, not the capability problem (can it run at all).

The labs are already building the routing

The frontier labs themselves are setting up the infrastructure for this hybrid future.

Apple Intelligence is the canonical example. When you make a request, the OS first decides whether the on-device ~3B parameter model can handle it. If it can, the request never leaves your phone. If it can't, Apple's Private Cloud Compute takes over — but only the relevant context goes up, processed on Apple-controlled silicon, with the data destroyed afterward. This is the hybrid routing pattern, in production, at iPhone scale.

Anthropic's Cowork, launched in January 2026, is a different angle on the same idea. Cowork lives on your desktop. It reads, writes, and modifies files in folders you designate. Compute still happens in Anthropic's cloud, but the work — your files, your context, your artifacts — stays local. You're no longer uploading your project to a chat window. The agent comes to where the data is. That's a meaningful architectural shift, and once that pattern is established, swapping in a local model for the lightweight steps inside an agent loop becomes the obvious next step.

OpenAI's gpt-oss release in August 2025 was the strangest move from a closed-by-default lab — and it makes sense only in this context. By open-sourcing 20B and 120B models under Apache 2.0, OpenAI created a release valve for everything that doesn't need frontier capability. The Codex CLI already supports routing to local Ollama or LM Studio endpoints. You can configure profiles: local model for "explain this function," cloud model for "refactor this module to use dependency injection." That's not a hypothetical. That's a config file today.

The harnesses these products bring — Claude Code's tool use, Codex's terminal integration, Cowork's file system access — are still better than what raw Ollama gives you. Ollama is a model runner, not a workflow. But Codex CLI already speaks to it. Claude Code can be pointed at local OpenAI-compatible endpoints. The harness and the local runtime are converging from both directions.

The cycle parallel — with a caveat

It's tempting to read this as the classic pendulum: mainframes gave way to PCs, PCs gave way to cloud, and now cloud is giving way to edge. There's truth to that — the gravitational pull of "compute should live close to the user" keeps reasserting itself whenever the local hardware gets capable enough.

But the more accurate frame isn't a swing back. It's stratification. Just as we settled into client-server after the PC era — neither pure mainframe nor pure desktop — the AI stack is settling into a tiered model:

  • On-device for the high-frequency, low-complexity, latency-sensitive, privacy-relevant majority of tasks. "On-device" is broader than it sounds — it covers laptops and phones, but also edge servers, branch appliances, and on-prem GPUs running quietly in a closet. The unifying property isn't form factor; it's that the inference happens on hardware the customer controls.
  • Network edge / AI RAN — a tier most enterprise pieces are missing. Telcos are beginning to treat base stations and regional aggregation points as future inference real estate, not just connectivity infrastructure. Nvidia AI Aerial, the Ericsson-Nvidia AI RAN partnership, and trials from SoftBank, NTT, and T-Mobile are all bets on the same idea: inference at the network edge, milliseconds away from the device, without ever hitting a hyperscaler region. This matters most for mobile use cases, AR/VR, autonomous systems, and any enterprise scenario where data sovereignty plus low latency both matter (manufacturing, healthcare, logistics with mobile workforces).
  • Private cloud / on-prem for sensitive enterprise workloads where you need more capability than the device or the network edge can offer, but the data can't leave the perimeter. This is where governed enterprise inference platforms increasingly live.
  • Frontier cloud APIs for the genuinely hard stuff — long-horizon reasoning, multi-step agentic work, anything that justifies the cost.

The interesting engineering problem of the next 18 months is the router — the layer that decides which tier handles which request. There's already academic work on this: uncertainty-aware models that escalate when they're unsure, classifiers that predict whether the local model's answer will be acceptable. Apple has shipped one version. The labs will ship others. The telcos, notably, want to be the routing point for mobile workloads — which is why AI RAN matters strategically even if it's invisible in the average enterprise architecture diagram today.

So what should you actually do about this?

Here's where most "hybrid AI" pieces stop. They explain the architectural shift, throw in a "prepare for the future" platitude, and end. That's not useful. The actual question is: if you're sitting in a corporate function watching this unfold, what specific decisions are you about to get wrong?

There's plenty written for engineers about model routing. Plenty for analysts about market dynamics. Much less for the three roles that are about to make the most consequential — and most reversible-only-at-high-cost — decisions: your CFO and procurement team, your IT director, and your CISO. The decisions you make in the next two quarters will shape your cost structure, your hardware fleet, and your risk surface for years. Three specific calls to action:

If you're in finance, procurement, or vendor management: be very careful about signing multi-year, high-dollar commits with frontier labs or hyperscalers based on current run-rates. Today's consumption mix — where every email summary and every code completion hits a frontier API — is not what next year's mix will look like. As routing matures and on-device handles the routine load, your token consumption against frontier APIs could drop 40-60% for the same end-user productivity. Lock yourself into a three-year commit at today's volumes and you're paying for capacity you won't use. Negotiate flexibility: shorter terms, ramp clauses, the right to shift commit between products, or consumption pools that span on-prem and cloud.

If you're in IT procurement or end-user computing: your three-year refresh cycle just became a strategic decision, not a routine one. Devices bought in 2026 will be in service through 2029. The gap between an AI-capable laptop (40+ TOPS NPU, 32GB+ unified memory or equivalent) and a non-AI-capable one will be the difference between handling 70% of AI workload locally versus paying API costs for everything. Stop buying minimum-spec machines because "everything runs in the cloud." That assumption is about to be wrong. Build NPU floors and memory minimums into your standard configurations now, even if it costs $200-400 more per device.

The payback math is straightforward — and faster than most procurement teams realize. The upgrade delta from a standard laptop to a properly AI-capable one is roughly $400 per device. For a typical knowledge worker, shifting around 50,000 frontier tokens per day to a local model — that's roughly twenty drafted emails, a couple of meeting summaries, and the routine code or document lookups that pile up over a workday — covers the upgrade cost within the laptop's three-year life on basic chat usage alone. For anyone using agentic tools like Claude Code or Codex, where a single workflow can burn millions of tokens, the payback collapses to a matter of weeks. Either way, the device pays for itself before the warranty expires. The only way this math doesn't work is if your team isn't using AI at all — in which case you have a different problem.

If you're in cybersecurity: this opens up a genuinely new attack surface, and most security programs aren't ready for it. A few of the new questions:

  • Local models running on employee devices mean sensitive data is being processed by software your security stack may not inspect. What's your DLP story when an LLM is summarizing confidential documents on the endpoint?
  • Agentic tools like Cowork have file-system access and can be hijacked through prompt injection in the documents they read. Anthropic itself flagged this. How do you sandbox an agent that needs broad access to do its job?
  • Open-weight models (Llama, gpt-oss, Qwen) downloaded onto corporate endpoints are software supply chain risk vectors. Who's verifying weights haven't been tampered with? Who's patching them?
  • The routing layer itself — the thing that decides what stays local and what goes to the cloud — becomes a sensitive policy enforcement point. Misconfigure it and your "local-only" sensitive data ends up in an API request.
  • Audit trails fragment. When inference happens partly on-device and partly in three different clouds, reconstructing what an agent did and why becomes meaningfully harder.

None of these are reasons to avoid the hybrid future. They're reasons to start designing for it now, before shadow IT does it for you.

The TL;DR for your next leadership meeting

If you only take one artifact from this post, take this:

Function What changes Decision to make in the next 90 days Failure mode if you wait
CFO / Procurement Token spend becomes variable and workload-sensitive Avoid multi-year frontier commits at current run-rates; negotiate flexibility, ramp clauses, cross-product pools Locked into inflated commitments for capacity that shifts on-device
IT / End-User Computing Laptop specs directly affect AI cost structure Set NPU floors (40+ TOPS) and memory minimums (32GB+) in standard configs Buying devices that force every AI task through a paid API
CISO / Security Inference moves onto endpoints, agents, and the network edge Update DLP for local inference; sandbox agentic file access; build a routing policy enforcement model Shadow local AI with no governance; fragmented audit trails

The free AI gravy train was never going to last. What replaces it isn't a price hike — it's a smarter division of labor. Your laptop runs what your laptop can run. The cloud handles what only the cloud can handle. And somewhere in the middle, a router quietly makes the call you used to make by reaching for ChatGPT every time.

The pendulum isn't swinging back. The architecture is just growing up. The question is whether your organization grows up with it.

Part 2 extends this beyond the buying and governance functions, because the next failure mode is behavioral: employees and agents continuing to spend frontier-model money on routine work. All the routing infrastructure in the world doesn't matter if your team is still reaching for Opus 4.7 to rewrite an email.

Is Your Organization Ready for Hybrid AI?

The shift from flat-rate to per-token is happening now. Let's build a strategy that saves you from locking into the wrong contracts, hardware, and security posture.

Book Your Free Assessment →