You Evaluate an AI Company by Its Boredom

Here is a due-diligence rule that has never failed me: you cannot evaluate an AI company by its AI.

The model is the commodity, the demo is the easy part. What you actually want to know is whether the thing survives contact with reality, and reality is boring, so you evaluate the company by its boredom.

The demo is the easy part

88% of AI pilots never reach production, and MIT Sloan put generative AI pilot failure at 95%. The median time from pilot approval to shutdown is 14 months.

None of these are model failures: the model that runs the dazzling demo is the same one that dies in production. The gap is not capability, it is everything the model depends on, the messy data, the legacy systems, the adversarial users, the failures that don't look like failures until a customer finds them first.

A demo runs on clean data in a controlled room. Production runs on a Tuesday, when the upstream API changed its schema and nobody told you. The demo tells you the company can build a demo, and that is all it tells you.

What boredom looks like

So when I am asked to look at an AI company, I don't watch the demo, I ask boring questions.

Do you have a golden dataset, a versioned set of inputs and expected outputs you score against? How often do you run evals, and what happens when a number drops? What is your hallucination rate, and how do you measure it? When the model fails, does the system fall back, or does it confidently invent something? Who owns monitoring in production, and how did they find out about the last regression?

The serious companies light up at these questions, because they have answers, and the answers are specific and a little tedious, the way a surgeon is tedious about handwashing. The wrapper shops get uncomfortable, they want to go back to the demo. (That discomfort is the entire report.)

Reliability is not capability

There is research behind this, not just instinct. Narayanan and Kapoor have spent years showing that reliability is an axis independent of capability. A more capable model does not hand you a more reliable product, because reliability gets built, with evals and monitoring and discipline, and a better foundation model does not give it to you for free.

Which means the interesting question in due diligence has nothing to do with which model they use, that part is settled the moment you walk in. What you came to see is everything wrapped around the model.

"We use the latest frontier model" is a restaurant bragging about its oven. Fine. Now show me the kitchen.

The honest version

The boring company looks worse in the room: the demo is shorter, the founder is less charismatic, the deck has a slide about evaluation infrastructure that glazes every eye in the building. And that company is the one still alive in three years.

If you are a founder reading this, build the boring layer and learn to make it visible, because the people writing the checks have finally started asking. If you are an investor, the demo is negative evidence, and the more it dazzles, the harder you should look at the kitchen.

Thirty years in this work taught me one durable thing: the exciting part is never where the value is. The value is in the boredom, and boredom is the hardest thing to fake.

— Pan