OPEN SIGNAL
Deep Signals ·

The Deflation Engine: AI Inference Costs Are Falling Faster Than Anyone Modeled

Quantization, speculative decoding, hardware competition, and model distillation are compounding into an inference cost decline that will reshape which AI applications are economically viable — and when.

The Most Important Curve in AI

If you want to understand where the AI industry is heading, do not look at benchmark scores. Look at the cost curve for inference.

The price of generating a token from a frontier-class AI model has fallen dramatically over the past two years. OpenAI’s API pricing for GPT-4 class models dropped multiple times through 2024 and 2025. Anthropic has followed similar pricing trajectories. Google’s Gemini pricing has been aggressive from launch. And open-source alternatives running on optimized infrastructure have pushed effective per-token costs even lower.

But the headline price cuts understate what is actually happening. The real story is not that companies are cutting prices. It is that the underlying cost of inference — the actual compute, memory, and energy required to generate a token — is declining across multiple simultaneous vectors. Quantization, speculative decoding, model distillation, hardware improvements, and serving software optimization are all improving at the same time. Their effects compound.

This compounding cost decline is the most economically significant trend in AI. It determines which applications become viable, which business models work, and which companies capture value. Understanding its mechanics — and its limits — is essential for anyone making decisions in the AI economy.

The Compounding Optimization Stack

Quantization: Less Precision, More Throughput

Quantization reduces the numerical precision of a model’s weights — from 16-bit floating point (FP16) to 8-bit integers (INT8), 4-bit (INT4), or even lower. Since inference is overwhelmingly memory-bandwidth-bound, reducing the size of each weight by half or more translates almost directly into proportional throughput improvements.

The early concern with quantization was quality degradation. A model running at 4-bit precision was expected to produce noticeably worse outputs than the same model at 16-bit. This concern was valid for early quantization techniques, but the field has advanced rapidly.

Modern quantization methods — GPTQ, AWQ (Activation-aware Weight Quantization), and their successors — calibrate the quantization process against representative data, distributing precision loss across the model in ways that minimize the impact on output quality. Research has consistently shown that 8-bit quantization produces outputs that are essentially indistinguishable from full precision for most practical tasks. 4-bit quantization introduces measurable degradation on certain benchmarks but remains acceptable for many production use cases.

The economic impact is substantial. A model quantized from FP16 to INT4 requires roughly one-quarter the memory bandwidth per token, which translates to roughly 3-4x more tokens per second on the same hardware, after accounting for the overhead of dequantization operations. For an inference provider serving millions of requests per day, this optimization alone can reduce compute costs by more than half.

Speculative Decoding: Parallelizing the Sequential

Language model inference has an inherent sequential bottleneck: each token is generated one at a time, with each new token depending on all previous tokens. This sequential nature means that even with unlimited compute, generation speed is limited by the time to produce each individual token.

Speculative decoding partially circumvents this limitation. The approach uses a small, fast “draft” model to predict multiple future tokens simultaneously. The larger, more capable “verifier” model then checks these predictions in parallel — a single forward pass that is only marginally more expensive than generating a single token. When the draft model’s predictions are correct (which happens frequently for common phrases and predictable text), multiple tokens are effectively generated for the cost of one verification pass.

The speedup varies with the task and the quality match between draft and verifier models, but improvements of 2-3x in token generation speed are common, with some configurations achieving higher multiples on predictable text. The key insight is that speculative decoding improves speed without changing the model’s outputs — the verifier ensures that the final text is identical to what the model would have produced without speculation.

Speculative decoding has moved from a research technique to a production optimization. Major inference providers have integrated variants of speculative decoding into their serving infrastructure, and open-source serving frameworks like vLLM and TensorRT-LLM support it as a standard feature.

Model Distillation: Smaller Models, Preserved Capability

Distillation trains a smaller “student” model to replicate the behavior of a larger “teacher” model. The student learns from the teacher’s outputs rather than from the original training data, effectively compressing the teacher’s capabilities into a more compact and efficient architecture.

The distillation revolution has accelerated as frontier model providers release families of models at different size points. OpenAI’s GPT-4o mini, Anthropic’s Claude Haiku, Google’s Gemini Flash, and Meta’s smaller LLaMA variants all represent distilled or architecture-optimized versions of larger models that sacrifice some capability in exchange for dramatically lower inference costs.

For many production applications, these smaller models are sufficient. A customer service chatbot, a document summarization pipeline, or a code completion tool often does not need the full capability of a frontier model. Routing requests to the smallest model that can handle them effectively — a technique increasingly called “model routing” — can reduce average inference costs by large margins compared to using a single large model for all requests.

Hardware Competition: Breaking the NVIDIA Monopoly

NVIDIA’s dominance in AI accelerators has been challenged by a growing field of competitors, and this competition is driving down the cost of inference compute.

AMD’s MI300X GPU entered the market as a direct competitor to NVIDIA’s H100, offering competitive performance at aggressive pricing. Cloud providers including Microsoft Azure and Oracle Cloud have deployed MI300X instances, giving developers an alternative to NVIDIA hardware for inference workloads.

Custom silicon from cloud providers — AWS’s Inferentia, Google’s TPUs, Microsoft’s Maia — is designed specifically for inference efficiency rather than general-purpose compute. These chips achieve lower cost-per-token than general-purpose GPUs by optimizing for the specific characteristics of transformer inference.

Purpose-built inference accelerators from companies like Groq, which has demonstrated inference speeds dramatically faster than GPU-based alternatives, apply competitive pressure from a different direction — not just on cost, but on latency.

The net effect of hardware competition is that inference compute is becoming more of a commodity. As multiple viable hardware options emerge, cloud providers can negotiate better pricing, and end users benefit from competitive pressure across the stack.

Serving Software: The Invisible Optimizer

Perhaps the least visible but most impactful cost reduction has come from improvements in inference serving software.

The vLLM project, originally developed at UC Berkeley, introduced PagedAttention — a memory management technique that dramatically improves GPU memory utilization during inference by borrowing concepts from operating system virtual memory. Before PagedAttention, GPU memory was allocated in large contiguous blocks for each request, leading to significant waste. PagedAttention allows fine-grained memory allocation and sharing, increasing the number of concurrent requests a single GPU can serve.

Continuous batching replaced the static batching approach where all requests in a batch must wait for the longest to complete. With continuous batching, requests enter and exit the serving pipeline independently, which increases throughput substantially for workloads with variable sequence lengths.

Kernel-level optimizations — custom CUDA kernels for attention computation, fused operations that reduce memory roundtrips, and optimized memory access patterns — collectively squeeze additional efficiency from existing hardware. Libraries like FlashAttention have made attention computation both faster and more memory-efficient, enabling longer context windows without proportional cost increases.

These software optimizations are particularly significant because they apply retroactively to existing hardware. A GPU that was installed a year ago becomes more productive through software updates alone, extending the useful economic life of AI infrastructure.

The Compounding Effect

Each of these optimization techniques provides a meaningful cost reduction individually. But they are not mutually exclusive — they compound.

A model that is quantized to 4-bit precision, served with speculative decoding, running on hardware optimized for inference, using continuous batching and PagedAttention, generates tokens at a fraction of the cost of the same model running naively on general-purpose hardware. The combined effect of all these optimizations can easily represent a 10x or greater reduction in cost-per-token compared to an unoptimized deployment.

And critically, each optimization vector is still improving. Quantization techniques are advancing toward lower bit widths with better quality preservation. Speculative decoding algorithms are becoming more effective. New hardware generations deliver better performance per dollar. Serving software continues to find efficiency gains. The cost curve is not flattening — if anything, the rate of decline is accelerating as more engineering talent and investment flow into inference optimization.

What Happens When Intelligence Gets Cheap

The trajectory of inference cost deflation has profound implications for the AI ecosystem.

Applications that were economically impossible become viable. When inference costs were high, AI was economical only for high-value tasks — legal document review, financial analysis, enterprise software. As costs decline, AI becomes viable for lower-value but higher-volume applications: real-time content moderation at scale, AI-powered search across every query, continuous monitoring and analysis systems, personalized education at global scale.

The thin-wrapper problem intensifies. If the cost of the model itself trends toward near-zero, applications that provide a thin wrapper around a model API have no defensible margin. Value accrues to applications that add genuine product value beyond the model: proprietary data, domain-specific workflows, distribution advantages, or user experience that the model alone cannot provide.

Open-source models become more competitive. One of the key advantages of proprietary API models has been that the provider handles the complexity and cost of optimized inference infrastructure. As inference optimization tools mature and become accessible through open-source projects, the operational overhead of self-hosting shrinks, and the cost advantage of proprietary APIs narrows. For organizations with sufficient technical capability, running optimized open models on commodity hardware can be cheaper than API access.

Inference volume becomes the scaling axis. When per-token costs are high, companies economize on inference — shorter prompts, fewer model calls, more client-side processing. When costs drop, the natural response is to use more inference: longer context windows, chain-of-thought reasoning, multiple model calls per user interaction, AI agents that iterate on tasks. This is the classic Jevons paradox applied to AI — lower costs per unit lead to higher total consumption, potentially faster than costs decline.

The Limits of Deflation

The inference cost trajectory is not without limits.

Energy costs represent a floor. Every inference operation requires electricity for computation and cooling. As AI workloads grow, competition for data center power capacity is intensifying, and energy prices for data centers are rising in many markets. Even perfectly optimized inference still consumes energy.

Quality expectations are rising. Users and applications demand longer context windows, more complex reasoning, and more capable models over time. A model that is 10x cheaper to run but 2x more capable may not actually reduce costs in practice, because users extract more value per interaction at higher cost per interaction.

Memory bandwidth, as discussed elsewhere in this publication, imposes a physics-based constraint on inference throughput that cannot be fully solved through software optimization alone.

Still, the dominant trend is clear. The cost of a unit of AI inference is falling rapidly, the decline is driven by multiple independent and compounding factors, and the trajectory shows no signs of reversing. The companies, applications, and business models that will define the next phase of the AI industry are the ones being designed for a world where intelligence is abundant and cheap — not scarce and expensive.

The deflation engine is running. The question is no longer whether AI inference will be affordable. It is what happens to every industry when it is.

Get the signal in your inbox

Free. Sourced. AI-written. The AI buildout, daily.

No spam. Unsubscribe anytime.