The plan was elegant: build a system so capable it could reason, write, code, and converse at superhuman speed, then charge a fraction of a cent per token and watch the future arrive. The future arrived. The invoice arrived with it. Nobody had budgeted for the invoice.
What the industry is now calling a “cost crisis” is, at its structural core, the discovery that inference — the part where the model actually does the thing — costs money every single time it does the thing. This is not a plot twist. This is arithmetic. The arithmetic was available before the keynotes. The keynotes went ahead anyway.
Now the scramble is on. Hyperscalers are quietly throttling context windows. Startups are “right-sizing” their token budgets, which is the industry's way of saying the unlimited free tier was always a coupon with an expiration date printed in the same color as the background. Prompt engineers — a job title that did not exist four years ago and may not survive four more — are being asked to write shorter prompts. The oracle has been asked to give shorter oracles.
The boardrooms are now full of people holding “stakeholder sessions” about “sustainable inference economics,” which is a phrase that means: we told everyone the meter wasn't running, and the meter was running. The sessions are themselves, of course, billed by the token.
The definitional gymnastics are something to behold. When the product is expensive to run, it becomes a “premium compute experience.” When usage must be capped, the cap is a “responsible deployment guardrail.” When a model that was promoted as the end of human cognitive labor can no longer afford to finish your email, that is a “cost-efficiency initiative aligned with long-term value creation.” The language is doing an enormous amount of physical labor for an industry that just discovered physical labor has a cost.
To be fair, the companies that built these systems did solve a genuinely hard problem. They built machines that can do almost anything you ask. The only thing the machines could not do was make that free. But that particular limitation was never demoed at the conference.
The real innovation, it turns out, was not the transformer architecture. It was the pitch deck that treated the electricity bill as a rounding error — right up until the moment the rounding error needed its own line item, its own task force, and its own emergency offsite, billed, presumably, to the cloud.