You’re Renting Intelligence — And Your Contract Doesn’t Cover What Matters
The IQ of the model needs to be better guaranteed
Over the 2025 holidays, Anthropic opened the floodgates. Higher limits, more Opus 4.5 access — a gift to anyone willing to code through Christmas while everyone else was eating pie. I took the deal. My GitHub commit rate over that break was the best stretch I’ve had in years. Opus 4.5 was sharp. Tool calling was crisp. I was building real things, fast.
Then January hit, and everything got worse. Way worse.
Tool calling got sloppy. I had to keep saying “resume” and “don’t repeat yourself.” The problem-solving ability dipped — not dramatically, but enough that you could feel it in your bones if you’d been deep in it. Reddit confirmed it wasn’t just me. Then Margin Lab started tracking it with actual benchmarks, charting degradation over time. Opus 4.5 performance was sliding — measurably, consistently, in one direction.
The models got dumber. Not metaphorically. Quantifiably.
Model degradation isn’t new, and it isn’t a conspiracy. There are mechanical reasons. These models absorb new data constantly — we demanded that, because nobody wanted to hear “I was trained on January 2023 data, sorry” anymore. But when the data changes dramatically, performance shifts. There’s the synthetic data problem: AI generating training data for AI, the quality of that loop is still an open question. And there’s plain old compute throttling — finite hardware, exploding user base, juice gets spread thinner. When they gave us extra capacity over the holidays, it was partly because half their users were on vacation. January brought everyone back. The math changed.
But none of that is the point. The point is what it means for anyone building products on top of these models.
Intelligence level is Infrastructure. We Don’t Treat It That Way
We’ve been telling our clients: build AI at the core, not at the edge. Still the right call in 2026. But if you’ve done that — if the value your product delivers depends on model intelligence — you’ve made intelligence level a dependency. Not compute. Not storage. IQ.
Unlike every other dependency you’ve ever managed, there is no contractual guarantee that it stays at the level you bought it at.
We have uptime SLAs. Latency SLAs. Throughput guarantees. There is no intelligence SLA. Your enterprise contract covers availability — last I checked, it says nothing about capability. The thing you’re actually paying for, how smart the model is, has zero protection.
If you manufacture physical products and your supplier starts shipping parts that don’t meet spec, you send them back. There’s a tolerance, it’s written down, everyone agrees before a single part ships. With LLMs, there is no spec for intelligence. You just hope today’s model performs like last week’s.
Or try this: imagine you outsourced your entire engineering team to a staffing agency, and every morning, different people show up with different skills, no memory of yesterday. You didn’t contract with the engineers. You contracted with the agency. The agency swaps anyone out anytime. That’s building on LLMs right now. You don’t know who’s going to show up tomorrow.
Even enterprise rates — dedicated clusters, private instances, the whole deal — don’t fully insulate you from drift. You still want model updates, which come with downsides. You’re still exposed.
So if your core functionality depends on intelligence — analysis, generation, reasoning, whatever — and the intelligence drops, what happens? Does your product just get worse? Do your customers notice before you do? Are there legal consequences?
These aren’t hypothetical questions anymore.
The Three-Layer Defense
What do you actually do? Three things, and you need them now.
Model abstraction
Consider a middleware layer that lets you swap providers based on capability and cost in real time. I architected a version of this at Typeform in 2023 — a routing layer that could plug in Anthropic, OpenAI, Google and switch between them. Redundancy and cost optimization. Calculate how much intelligence a feature actually needs, route to whoever supplies it cheapest. Flash for the simple stuff. Claude for complex reasoning. Dynamic switching when something degrades.
This week, we designed a system for a client that uses OpenAI’s lowest model for spam filtering, then escalates borderline cases to Claude. This kind of tiered routing used to be nice-to-have. Now it’s table stakes. If your code is tied to one model and it degrades, you’re one bad week from a production incident.
Portable memory
Context and memory are becoming the real differentiator in agentic systems. But if your vectorized data, your process memory, your accumulated context is locked to one provider — you’re stuck. Model gets dumber, your memory gets less useful because it can’t reason over it as well. Provider goes down, memory goes with it. Intelligence should be portable. Memory should be portable. Two separate problems, both urgent.
Intelligence monitoring
Continuous eval batteries testing your model against the specific thresholds your product requires. Not leaderboard benchmarks — yours. The ones mapped to features your customers pay for. Understand the minimum intelligence threshold to deliver value for each piece of your product, then monitor against it. Catch degradation before your customer catches it for you — because when they catch it, it’s a brand problem, and nobody cares that it was your provider’s fault.
Same discipline you’d apply to any critical dependency. The question used to be “Is the server up?” Now it’s “Is the server still smart enough?”
What’s Next: Monitoring, Contracts, Insurance
Every piece of critical infrastructure eventually gets three things: monitoring, contracts, and insurance. Compute did. Bandwidth did. Uptime did. Intelligence will.
Monitoring is already emerging. Margin Lab tracks degradation in quantified form today. The category will grow. Every company building on LLMs will need something like it.
Contracts follow. Someone will write the first real intelligence SLA into an enterprise deal — not uptime, not throughput, but capability. A battery of evals benchmarked at contract signing, with guarantees that performance stays above a threshold. Doesn’t exist yet. Should. There’s a case for an independent standards body maintaining a shared eval battery — something enterprises run against providers the way you’d run an audit. Sounds early, but so did SOC 2 once.
Insurance comes last, as always. Rented intelligence drops below threshold; someone makes you whole. Sounds like a stretch until you remember every other infrastructure risk got an insurance product eventually.
Monday Morning Implications
Abstract your model. Make your memory portable. Monitor your intelligence thresholds. Start the contract conversation with your providers.
We’ve been saying build AI at the core. We still mean it. But building at the core means this is critical infrastructure now, and critical infrastructure demands rigor. The companies that treat their LLM dependency with the same seriousness they treat their cloud dependency will be fine. The rest will find out the hard way that rented intelligence comes with rented risk.
🔍 Want to dive deeper?
Check out our book BUILDING ROCKETSHIPS 🚀 and continue this and other conversations in our 💬 ProductMind Slack community and our LinkedIn community.
🎧 Prefer to listen? Check our podcast below ↓
🎥 YouTube → Click Here
🎵 Spotify → Click Here
🎙️ Apple Podcasts →Click Here
📣 Live: Escaping the AI Build Trap
After two standout AMAs, we’re excited to welcome Melissa Perri, author of Escaping the Build Trap and CEO of Product Institute, for “Escaping the AI Trap.”
We hope you will join us!
📅 Tuesday, Feb 17
⏰ 9am CST
Register ➡️ HERE ⬅️





