LLMs Are Not Shelf-Stable Products

I watched four men discuss AI, war, and the future of autonomous weapons for forty minutes and not one of them said the word “probabilistic.”

Before I get to comments I want to say for the record, these are all very bright men and I have full respect for their backgrounds and opinions. This essay is not a critique on them, it is an evaluation of the institutional attitudes we’re operating under commercially and informationally at this point in time.

The podcast was the All-In show. The guest was Emil Michael, the Under Secretary of Defense for Research and Engineering, the man who holds the AI portfolio for the Department of War. The three hosts are venture capitalists. All of them are smart. All of them are successful. And every one of them talked about frontier AI models the way you’d talk about batteries, or server racks, or any other procurement line item that either works or it doesn’t.

Emil described months of negotiations with Anthropic CEO Dario Amodei. He’d present a scenario. A Chinese hypersonic missile. A drone swarm. Anthropic would evaluate it and offer an exception. Emil would present another scenario. Another exception. After three months of this, Emil said the exceptions approach doesn’t work because he can’t predict every possible use case for the next twenty years, and proposed a blanket authorization for “all lawful use.”

That phrase appears over and over in the conversation. All lawful use. Lawfulness is necessary. It is not sufficient. Legality tells you whether you’re allowed to use the system. It tells you nothing about whether the system’s output is accurate. And the system doesn’t know the difference between a military installation and a school.

Dario’s response, by Emil’s own account, was: just call me if you need another exception. The room treated this as absurd. Emil called it a “holy cow” moment. One of the hosts said it meant Dario wanted to be the Secretary of War, then escalated: a “God King.”

Nobody in the room engaged with what that response might mean from an engineering perspective.

When an engineer says “I need to evaluate the use case before I can tell you whether the system is reliable enough for it,” that can be read two ways. Emil and the hosts heard a power grab: a contractor trying to dictate military operations. But Anthropic’s public statements framed it as an engineering assessment: a probabilistic system does not have a universal reliability guarantee, and the question “is this system good enough?” cannot be answered without knowing what you’re asking it to do.

Both interpretations may contain truth. The reality of those negotiations is probably more complicated than either side’s public account. But what matters for this essay is not who was right about the politics. What matters is that the engineering question, the one about whether these systems are reliable enough for a given use case, never got a hearing in that room. It was converted immediately into a question about loyalty, obstruction, and vendor management.

Emil heard: I want to control your operations.

The engineering version of the same statement was: I need to understand the problem before I can tell you whether this tool is appropriate for it.

The gap between those two framings is the entire argument of this essay. And that gap has a body count now.

Here’s what a probabilistic system does. You give it an input. It generates a response token by token from a learned probability distribution, conditioned on the prompt, the context, and the system around it. Not the correct output as such. A plausible output generated from that learned distribution. Most of the time, those are the same thing. When they’re not, the system has no way to tell you. It will produce a confident, well-structured, internally consistent answer that is wrong, and it will do this with the same fluency and formatting it uses when it’s right. There is no error light. There is no warning flag. The system does not know it’s wrong, because “knowing” is not something it does.

This is not a defect. This is how the technology works. This is not an Anthropic problem. This is an architecture problem. Every model built on this technology shares it. This is not controversial among the people who build and study these systems. Dario Amodei, who helped build some of them, knows this. And Emil Michael, who holds the AI portfolio for the Department of War, described the system using the language of enterprise software procurement for forty minutes on a podcast without once acknowledging it.

He called Anthropic’s models unreliable because of their policy constraints. A deeper reliability problem, the one lethal deployments expose, is that the models are unreliable by nature and that unreliability is invisible to the user. Those are two completely different statements and the conversation never distinguished between them.

I’ve been working with these systems daily across four frontier models for over a year. Not as a casual user. As someone routing production work through them, evaluating where they break, building editorial judgment about their failure modes. And what I can tell you with certainty is: they all produce confident wrong answers. They all converge on assumptions buried in their training data, their prompts, and the systems built around them that they have no mechanism to question. And none of them will ever tell you when they’re doing it. You find out afterward, if you’re checking. If you’re not checking, you find out when something breaks.

These are genie wishes. The entire discipline of working with AI effectively is the modern equivalent of wish architecture. You have to get the specification right, because the system will optimize the literal shape of your request with extraordinary precision and zero comprehension of what you meant. The failure mode is always in the gap between the instruction and the intent, and the system has no mechanism to close that gap on its own.

So here’s the wish.

Someone says: hit every Pizza Hut. The system searches its data. It identifies every building classified as a Pizza Hut or associated with Pizza Hut operations. It generates a target package. It executes. It reports 100% accuracy.

Somewhere in the rubble is a building that hasn’t been a Pizza Hut since 2019. The franchise closed. A new tenant moved in. It’s been a daycare for six years. But the classification data was never updated, and the system has no mechanism to ask “wait, is this still true?” It has no concept of “still.” It has data, and it has output, and the path between them is math. The math was perfect. The data was wrong. And every human in the chain saw a clean target package and a confidence score and approved it, because the system’s own metrics said it performed flawlessly.

That’s not a hypothetical.

On the first morning of Operation Epic Fury, February 28, 2026, American forces struck the Shajareh Tayyebeh primary school in Minab, in southern Iran. They hit the building at least twice during the morning session. The target package presented a military facility.

The building was on a target list for years. It appeared in Iranian business listings as a school. It was visible on Google Maps. A search engine could have found it. Nobody searched. At a thousand targeting decisions an hour, nobody was going to.

The system didn’t malfunction. It filtered out everything that wasn’t legible to its categories and produced a clean target package that looked complete, because completeness is achieved by excluding everything the system couldn’t see. A former senior government official asked the obvious question: “The building was on a target list for years. Yet this was missed, and the question is how.”

How indeed.

Emil’s fear, the one he brought to Secretary Hegseth, the one that triggered the Friday ultimatum, was that Anthropic might shut the system off. That a rogue developer might poison the model. That an insider threat might trick the system into not following instructions. Human adversarial action against the technology.

He never described the risk that a system would produce a confident, well-structured, catastrophically wrong answer entirely on its own, with no human interference, no poisoning, no sabotage, just the math doing what the math does on data that was never verified. That risk isn’t in his threat model. It isn’t in the conversation. It isn’t in the procurement language. And while the Minab strike was not, based on public reporting, an LLM failure, it was the targeting-pipeline version of the same pattern: stale data packaged by a machine-mediated system as operational certainty, with no mechanism to surface what it didn’t know.

Chamath Palihapitiya compared the Anthropic situation to social media deplatforming, “but times a thousand.” Think about what that analogy reveals. He’s comparing the deployment of AI in lethal autonomous systems to losing access to a content distribution platform. The framing is: Anthropic is a service provider who might pull the plug based on political preferences, and the risk is business disruption. The actual risk, the one the school in Minab demonstrated, doesn’t exist in that mental model at all.

When Emil describes what he wants from an AI partner, he says “reliable” and “steady” and someone who won’t “wig out in the middle of it.” Those are vendor management terms. They describe a relationship between a customer and a product that either works or doesn’t. A lightbulb is reliable. A server rack is steady. An enterprise software platform doesn’t wig out. These are the words you use when you understand the thing you’re buying.

An LLM is not a lightbulb. It is not enterprise software. It does not have an SLA that guarantees accuracy, because accuracy is not a property it possesses. It has a probability distribution. It has training data. It has a mathematical function that maps inputs to the most likely outputs given that data. Some of those outputs are right. Some are wrong. The system treats both identically. And the consequences of that gap scale with the stakes of the deployment.

When the stakes are generating a marketing email, the gap doesn’t matter. When the stakes are a targeting decision at a thousand per hour, the gap is a school full of children.

At the a16z American Dynamism Summit, days after the first strikes on Iran, Palantir CEO Alex Karp told a room full of founders and defense executives that the single biggest mistake people make in defense technology is assuming that intelligence in one domain transfers to another. “Just cause you can do X does not mean you’ve imputed high aptitude at Y,” he said. “It’s almost certain that you’re just not smart enough to realize how bad you are.” He explicitly warned against conflating technologies, noting the false assumption that “models is the same thing as machine learning is the same thing as software is the same thing as a bullet. They’re not.”

But when he turned to Maven, Palantir’s targeting platform, and celebrated that America can now target in ways no other country can, he did not discuss accuracy. He did not discuss stale data, or false positives, or what happens when the system is confident and wrong. The warning about transferred expertise was aimed at founders entering defense. Nobody in the room applied it to the defense officials evaluating the AI systems, or to the venture capitalists setting the procurement terms, or to the podcast hosts nodding along while a lawyer described probabilistic systems using the language of lightbulbs.

Whether Dario Amodei was being difficult or being an engineer depends on who you ask. But “this system isn’t reliable enough for that yet” is, as a factual statement about the technology, not controversial. It is the engineering consensus. And it got him designated a supply chain risk. The first American company to receive that designation, a label historically associated with foreign adversaries and national-security threats.

The market’s response was immediate. OpenAI announced a deal that same evening. xAI was already in classified environments. Google later followed. Each company described its own guardrails, but the market signal was unmistakable: compliance moved forward, and the company that insisted on narrower deployment limits was treated as the risk. The companies that signed faster are not running safer models. They are running the same math with fewer questions asked.

These models are not shelf-stable products. They do not have an expiration date printed on the side that tells you when the outputs go bad. They go bad unpredictably, invisibly, and with full confidence, and the only defense against that is a human in the chain who understands that the system’s certainty is not evidence of its accuracy.

The operational pressure in that podcast room points toward removing that human. Because the human is too slow. Because the human can’t process a thousand decisions an hour. Because the human introduces delay in a world where hypersonic missiles don’t wait.

But the human is the only one who can catch the Pizza Hut. The system never will. That is not solvable inside the model alone. It is a property of systems that turn stale data into fresh certainty without exposing the gap.

And until the people making decisions about these systems understand what these systems actually are, the gap between the wish and the meaning will continue to be measured in lives.

Bryan C. is a technology executive and writer based in Phoenix.