Skip to content
VirtueSig
Latest on Artificial Intelligence, Large Language Models, Transformers, and Chip Innovations.
Frontier Models

How OpenAI and others throttle GPU compute and what to do about it

OpenAI and others quietly throttle GPU compute behind stable model names. Here's how you can detect it and defend against it (if you're big enough).

Macro photograph of a red-and-black computer microprocessor on a circuit board.

You have probably noticed it. ChatGPT has felt dumber. Across Reddit, X, and the comment sections of every flagship-model release thread, the complaint follows the same arc: rapture at launch, then a slow drift downwards in user-reported quality. The phrase that keeps coming up is that the model “has taken stupid pills”. The phenomenon is real. What consumers feel as a vague capability slip is what larger companies running production AI workloads should be measuring with a dedicated harness.

The economics underneath every consumer AI subscription are getting worse before they get better. OpenAI's chief executive said publicly in late 2024 that the company was losing money on ChatGPT Pro, its $200-a-month consumer tier. ChatGPT image generation alone must be a huge money suck that's costing OpenAI hundreds of millions in compute. The pressure has only intensified since: aggressive rate limits, "fair use" tightening, and quiet adjustments to inference behaviour have rippled across both consumer and API products. As frontier-model gross margins compress, usage goes up, providers retain a long list of levers they can pull without changing the name on the model, and most of those levers are invisible to the customer paying the bill.

The industry shorthand for what comes next is "compute choking": the gradual, undisclosed degradation of inference quality that lets a provider keep a flagship product line live while slashing the GPU cost of serving more users. Quantization is the most direct mechanism. A weights file moves from FP16 to INT8 to INT4. Throughput rises, accuracy on long-context, multi-step, and low-frequency-token tasks falls, and the model still answers to the same name in the API response.

For larger companies that have built workflows on a specific model version, this is not a hypothetical. It is a slow-burning operational risk that almost no buyer at any scale is actively monitoring for. The intuitive starting points (pinned model versions, automated test questions, an LLM-as-judge for quality scoring) sound like reasonable defences, and a thoughtful procurement team's first instinct will reach for them. Each, on its own, leaks in identifiable ways. What follows is where the obvious moves fall short, what larger companies should actually be doing instead, and how to translate those defences into contract language that gives a customer a remedy rather than regret.

Why providers will quietly degrade quality, and how

The cost of running a frontier model is dominated by GPU-hours per token served. Every operational lever that reduces those hours per token improves the provider's gross margin without raising the customer's invoice. The lever set is wider than most non-specialist buyers realise.

Quantization. Weights compressed to lower bit-widths fit in less HBM and execute faster on the same accelerator. INT4 is roughly four times cheaper to serve than FP16 on most stacks, with quality losses concentrated in long-context reasoning and rare-token tasks where the customer is least likely to notice in median traffic.

Speculative decoding and KV-cache compression. Inference optimisations applied post-hoc to a deployed model. Output token distributions shift subtly, even for identical inputs. Throughput improves; behaviour changes.

Routing and cascading. A customer's request is silently steered to a smaller draft model that handles "easy" cases, with only "hard" cases falling through to the named flagship. The model labelled in the API response is the upstream router, not necessarily the model that produced the answer.

Hardware migration. A move from H100s to lower-memory L40s, or from one accelerator generation to another, changes numerical precision in ways that compound through long generations.

Tokeniser and system-prefix changes. The hidden system prompt or the tokeniser table is updated. Behaviour changes even though the weights file is nominally unchanged.

Each lever can be applied to a model that retains its public name and version label. The customer notices the bill staying flat. The provider notices the gross margin improving. Whether the customer notices the quality drop depends entirely on how they monitor, and very few enterprises currently monitor at all.

The obvious starting points, and why they aren't enough

Each of the three intuitive moves is the right instinct. None of them is sufficient on its own.

Pinning a model version

Calling a dated snapshot like gpt-4-2024-08-06 or claude-opus-4-6-20260301 instead of the floating alias is a sensible first step. It is also a thin guarantee. The snapshot freezes only the weights label, not the inference stack. Quantization can change without renaming. Speculative decoding can be turned on. A draft router can be inserted upstream. Tokeniser tables can be updated. Eventually the snapshot is deprecated, forcing migration to a successor on the provider's calendar, not the customer's. Pinning is a useful baseline, but only when paired with active monitoring of what the pinned name actually returns over time.

Live automated test questions

Running a fixed eval harness against the API on a regular cadence is the next obvious move, and it is also where the easiest holes appear. If the harness runs from a known API key, on a predictable schedule, with structured prompts, it is trivially fingerprintable. A provider that wanted to could route eval-shaped traffic to the high-quality inference path while degrading everything else. Even without active malice, leaked test sets get implicitly optimised against (the Goodhart problem), output variance hides small dips behind statistical noise, and canned tests rarely cover the long tail of real customer queries, which is where quantization hurts first.

Quality scoring with LLM-as-judge

Pointing a stronger model at the smaller model's outputs to grade them is a common LLM-ops pattern, and it has a structural problem. The judge is usually a model from the same provider whose outputs are being audited. Direct conflict of interest. Judges also drift, hallucinate, and exhibit known biases including length, position, and self-preference. Without periodic human-graded ground truth, an "objective" quality score is two correlated black boxes voting on each other.

None of these starting points is useless. They are the right beginnings, just not the whole answer. Each was designed for a relationship of much higher implicit trust between provider and customer than the current commercial environment supports.

What enterprises should actually be doing

A serious monitoring posture treats the provider as adversarial, the API surface as untrusted, and combines several independent signals so that no one of them can be neutralised in isolation. Eight tactics are worth combining, and almost no enterprise has all of them in place today:

1. Camouflage your evals. Mix probes into real user traffic. Vary the API key, region, time of day, and prompt formatting. If a test request is indistinguishable from a customer request, it cannot be selectively routed. Rotate prompts so fingerprints do not accumulate over time.

2. Monitor characteristics, not just correctness. Track distributions of response length, logprob entropy where exposed, refusal rates, latency, and lexical diversity. Quantization shows up in these signals before it shows up in answer quality. A sudden latency drop with apparently stable outputs is the textbook tell that the model behind the customer's name has shrunk.

3. Hash deterministic outputs. With temperature set to zero and a fixed seed where the API supports it, the same prompt should yield the same tokens. Hash a daily canary set. Any divergence on a "pinned" model means the inference stack changed regardless of what the version label says.

4. Cross-provider triangulation. Run the same evals against two or three providers plus an open-weights baseline (Llama, Qwen, DeepSeek) the customer controls. Drops in one model relative to the others are signal independent of any single judge, and the parallel deployment supplies a credible exit threat, which is itself the cheapest form of defence.

5. Independent judges, anchored to humans. Use a different provider's model as judge, or require agreement across multiple judges. Anchor quarterly to a small human-graded set so the team can tell when the judges themselves have drifted.

6. Stress-test the edges. Quantization degrades long-context reasoning, low-frequency tokens, and multi-step chain-of-thought first. Over-weight these in any eval suite rather than the easy median cases, which are the last to break and therefore the worst signal.

7. Semantic drift detection. Embedding-based comparison of responses against a known-good baseline, using SBERT cosine similarity or an equivalent, catches the case where the model is technically still right but its reasoning style has shifted. That is a classic post-quantization symptom.

8. Portable abstraction layer. A routing layer such as LiteLLM, OpenRouter, or an in-house provider adapter. The goal is to be two days from swapping providers, not two months. A credible exit changes the negotiation before any monitoring evidence is in.

A ninth move is worth considering even when it does not match the closed providers on quality: a small fine-tuned open-weights model, deployed on the customer's own infrastructure, on the customer's specific task. Even at 80% of the quality of the closed alternative, it functions as an immovable reference point. When the closed provider's quality starts converging downwards toward the local model rather than pulling further ahead, the team has unambiguous evidence and an immediate fallback.

Writing it into the contract

Technical monitoring tells a customer when something has gone wrong. The contract is what gives them a remedy. Most standard MSAs are silent on the levers described above, and silence in a contract is a license for the counterparty to do whatever they want.

The clauses below translate the monitoring posture into enforceable obligations. None of this is legal advice. A customer's counsel must draft the actual language and negotiate it. Treat what follows as a structural guide.

Define the "Model" precisely, not by name

The single most important clause. The contract should define what the customer is paying for as a technical specification, not a marketing label. Useful elements:

  • Weights identifier. A hash, checksum, or at minimum a frozen version label tied to a specific release date.
  • Numerical precision floor. A minimum bit-width for both weights and activations, with explicit prohibition on quantization below that floor.
  • Inference stack disclosures. Whether speculative decoding, KV-cache compression, attention or expert pruning, or routing and cascading is in use, with a commitment that any addition triggers customer notice.
  • Hardware class commitment, or at minimum notice of any material change to the accelerator class on which the customer's inference runs.

Indicative language: Provider shall not apply weight quantization below [X bits], introduce inference-time optimisations that materially alter output distributions, or route Customer requests to a smaller or distilled model labelled as the Pinned Model, without [90] days prior written notice and Customer's right to terminate without penalty during the notice period.

Snapshot retention and deprecation notice

Pin a minimum support window for any model version a workflow depends on. Twelve to twenty-four months is a reasonable enterprise ask. Combine that with a deprecation notice period of 90 to 180 days, and ideally a paid extended-support option beyond it. Without these clauses, a provider can force migration onto a degraded successor at a time of their choosing.

Performance baseline and tolerance band

This is where the eval suite enters the contract:

  • At signing, both parties run an agreed benchmark suite and record baseline scores. The benchmark must include edge-case categories: long-context tasks, multi-step reasoning, structured output adherence, and low-frequency-token tasks. These are the categories that degrade first under quantization.
  • Define a tolerance band. A workable starting point: no category may drop more than 5% from baseline on a rolling 30-day window before remedies trigger.
  • Specify cadence, sample size for statistical significance, and dispute resolution for results that disagree across parties.
  • Reserve the right to engage a third-party benchmarking firm, with provider cooperation, on request.

Critical detail: the benchmark should be jointly agreed but not fully shared with the provider's inference path. A held-out portion is essential. Otherwise the benchmark becomes training data the provider quietly optimises against, and the customer ends up validating a synthetic floor rather than measuring real behaviour.

Determinism clause

For requests with temperature set to zero and a fixed seed, output should be reproducible within stated bounds. This is the check that is hardest for a provider to fake. If a "pinned" model produces different tokens for identical deterministic inputs over time, something in the stack changed regardless of what the version label says.

Tiered remedies

Single all-or-nothing breach clauses rarely trigger. Tier them so the contract's enforcement curve matches the severity of the underlying degradation:

  • Minor degradation (1% to 5% drop on any tracked metric): service credits proportional to the drop, applied automatically without a claim.
  • Material degradation (5% to 15% drop, or any silent quantization or routing change): a larger service credit (typical structures use 25% to 50% of monthly fees for the affected period), plus an obligation to remediate within a defined window.
  • Severe degradation (15% or more drop, or undisclosed model substitution): right to terminate without penalty, full refund of prepaid amounts, and explicit migration cooperation obligations.
  • Repeated breach: termination with liquidated damages.

Service credits should be automatic where measurable, not contingent on the customer filing a claim. Providers count on most customers never bothering. A clause that requires automatic credit when a defined metric crosses a defined threshold turns the eval pipeline into an enforcement mechanism instead of a forensics exercise.

Disclosure and audit rights

  • A material-change notification clause that defines what counts: quantization changes, hardware migration, RLHF or post-training updates, system-prompt changes, routing-layer modifications.
  • A public or customer-accessible changelog.
  • Audit rights to receive inference metadata (logprobs, latency distributions, model fingerprint signatures) on request, under NDA.
  • A clean right to use third-party monitoring tools without provider interference or rate-limiting.

Anti-silent-update clause

Specifically prohibit further training, RLHF, or weight modification of the pinned snapshot. Providers sometimes "patch" pinned models for safety reasons. Acceptable, but the patch must be disclosed, and the customer should retain rollback rights if it materially affects their use case.

Exit and portability

  • Cooperation in migrating any fine-tunes, embeddings, or cached context to a successor provider.
  • Continued read access to customer data and any artifacts for 180 or more days post-termination.
  • No lock-in clauses tied to proprietary formats the customer cannot export.

Negotiating reality, by spend tier

What is achievable depends almost entirely on annual spend.

  • Under approximately $500,000 per year. Standard terms only. The best a customer can do is choose providers whose standard terms are more transparent, and self-insure with the technical defences above.
  • $500,000 to $5 million. Deprecation notice periods, basic performance SLAs, and service credits become negotiable. Custom benchmark clauses are still a stretch.
  • $5 million to $10 million. Model versioning commitments, custom benchmark clauses, dedicated capacity, and a meaningful refund structure are realistic.
  • $10 million-plus, or strategic partner status. Bespoke terms including hardware specifications, dedicated inference clusters, and meaningful performance guarantees.

One high-leverage move regardless of size: ask for the standard MSA's change-notification clause and the deprecation clause specifically, in writing, before signing. Many providers' standard contracts are silent on both. Forcing the question often produces better terms than the customer would otherwise expect, because the provider's legal team would rather grant a 60-day notice clause than open the rest of the agreement to negotiation.

The deeper point

Trust no single signal that lives entirely inside the provider's stack. A counterparty with full information asymmetry can defeat any defence that runs on their side of the API. The robust posture combines external comparison (other providers and open weights), statistical monitoring of characteristics rather than just outputs, a credible threat to leave, and a human-graded ground-truth anchor that cannot be quietly co-opted.

The contract is the floor, not the ceiling. Camouflaged evals, cross-provider triangulation, deterministic hashing, and characteristic monitoring are what catch the silent move. The contract is what gives the customer a remedy when their monitoring catches it.

Almost no enterprise has any of this in place today. That is the gap worth closing. As the gap between consumer-tier subsidies and enterprise margin requirements widens, the incentive on the provider side to silently improve their unit economics will only grow. Anthropic's own forecast that AI systems will be training their successors by 2028 implies an even steeper compute curve ahead, with the same margin pressure cascading through every tier of every provider's customer book. The customers who will keep getting the model they signed up for are the ones who put this monitoring and contract architecture in place before the squeeze starts to show.

Read Next