The tech world has a favorite essay. Walk into any AI strategy meeting, and someone will inevitably reference Rich Sutton’s “The Bitter Lesson” to justify why they need more compute, more data, more hands. It’s become the go-to intellectual cover for brute-force approaches to AI, a sophisticated way of saying “just throw more resources at it.”

But here’s what separates serious AI practitioners from the pseudo-intellectuals:
Understanding that Sutton wrote another, equally important essay called “Verification”. If a person only quotes the Bitter Lesson essay without acknowledging the Verification essay, you’re likely dealing with someone who hasn’t fully grokked what’s actually required to deploy AI successfully in the real world.

The Real Bottleneck Isn’t What You Think

Something that most don’t want to say out loud:

We are no longer limited by an LLM’s ability to consume and generate. We are now limited by our ability to review and verify what LLMs are generating.

This fundamental challenge faces every enterprise trying to deploy AI at scale. While the industry has converged on “LLM-as-Judge” as a stopgap solution, this approach won’t solve the underlying problem. The real answer lies in using environments and simulation to verify agents. This same paradigm has already enabled some of the most successful autonomous deployments affecting our physical world.

Learning from Real-World Autonomous Systems

While we struggle with prompt engineering and agent architectures, neural network-based systems are already delivering massive efficiency gains in high-stakes physical environments. Let’s look at three groundbreaking examples:

DomainWhat the Autonomous System controlsReal-world Payoff
Data-center cooling (Google, Meta)Fan speeds, chiller set-points, pump flow ratesCut cooling energy use by up to 40 % and total facility power by ~15 %. Google lets the agent act directly, while a human operator watches the safety constraints. (deepmind.google, wired.com, engineering.fb.com)
Nuclear-fusion plasma (EPFL TCV tokamak)Currents in 19 magnetic coils every millisecondMaintains complex X-point and “snowflake” plasmas for seconds and even creates shapes no human expert has tried before—an essential step toward steady-state fusion. (nature.com, deepmind.google)
Chip floor-planning (Google TPUs)Placement order of ~10,000 blocks on siliconProduces production-ready layouts in <6 h that match or beat senior engineers on power, performance, and area; the method has shipped in several TPU generations. (nature.com, wired.com)

The Common Patterns

What made these deployments successful? Four critical patterns:

1. High-Stakes, High-Dimensional Control These systems juggle thousands of interacting variables in real-time, something rule-based controllers can’t handle.

2. Sample-Efficient Training None of these agents learned by “just trying things” in production. They relied on simulators, offline logs, and domain randomization before being fine-tuned on-site.

3. Human-Plus-AI Workflows The RL policy handles microsecond control; humans set goals and safety constraints.

4. Generalization Beyond Training These systems demonstrated transferable learning. For example, the tokamak controller created entirely novel plasma configurations.

The Critical Insight: Where Human Expertise Actually Matters

Here’s where the Bitter Lesson gets interesting. While these successful deployments used general-purpose learning methods (neural networks), human expertise wasn’t eliminated. Instead, it was relocated. Domain experts invested their knowledge into:

  • High-fidelity simulators that accurately model the problem space
  • Reward engineering that encodes what success actually looks like
  • Safety constraints that prevent catastrophic failures

The neural networks themselves remain uninterpretable black boxes. But by embedding human knowledge into the verification and simulation layer, we achieve both the scalability of learned systems and the reliability enterprises require.

Real-world deploymentCore simulator / toolkitWhat it models
Data-centre cooling (Meta)Physics-based building-energy model (custom, similar to DOE-2/EnergyPlus)Air-flow, heat, humidity, fan power, water usage
Tokamak plasma control (Google DeepMind + EPFL)High-fidelity magnetic-equilibrium codes (EFIT/SPIDER) + fast differentiable transport model TORAX written in JAXFull plasma current-density evolution and coil circuit dynamics at ~10 kHz
Chip floor-planning (AlphaChip / “Circuit Training”)C++ cost binary PLC wrapper + DREAMPlace placer + net-list protobufWire-length, congestion, density in sub-10 nm blocks

From Physical Systems to LLM Agents

This brings us to today’s challenge with LLM-based agents. There is no perfect prompt. There is no perfect agent architecture. But there exists a prompt and architecture that will get you in the ballpark, and that’s what matters.

We need to leave the confines of deterministic unit tests and enter a world of probability distributions. The question becomes “Is the likelihood of success high enough for my use case?” rather than demanding perfection every time. This requires robust evaluation systems backed by diverse sample sets. Think of it as good old statistics, reimagined for the age of AI.

The Playbook for Enterprise AI

As we move toward increasingly autonomous systems, our efforts are best utilized in building verification systems and high-fidelity simulators that align models to our definition of utility and intelligence. This approach provides the scalable way to overcome our current limitation: our inability to review and verify AI outputs at the pace they’re generated.

The playbook is clear:

  1. Invest in definitions of effective metrics, criteria, and evaluations
  2. Build simulation as the scalable way to align prompts, agent architectures and models
  3. Leverage domain expertise in the verification layer, not the policy itself
  4. Design reward systems that encode your actual business objectives.

To be precise, here’s how this playbook stacks up against the warnings in the Bitter Lesson:

AspectBitter Lesson viewActionable Playbook Step
Reward & ConstraintsFine: they specify what we want, not how to achieve it.Invest in definitions of effective metrics, criteria and evaluations
High-fidelity SimulatorFine: it generates data so general methods can learn—it is not part of the deployed policy.Simulation is the scalable way to align prompts, agent architectures and models.
Hand-tuned features or decision rules inside the NNFails the lesson;Domain expertise should be invested as much as possible upstream when designing an autonomous system.

Building the Future at Kashikoi

At Kashikoi, we’re making this playbook accessible to every enterprise. We’re building the simulation and evaluation infrastructure that will enable the next generation of reliable, high-impact AI deployments.

The companies that win in the AI era won’t be those with the most compute or data. They’ll be those who master the art of verification through simulation. The Bitter Lesson is only half the story. The sweet victory comes from knowing where to apply human insight in the age of autonomous systems.

Stay tuned for exciting announcements about our enterprise collaborations and product releases. The future of verified AI is closer than you think.


Want to learn more about building robust AI systems for your enterprise? Contact us at founders [at] getkashikoi.com to discuss how simulation and verification can unlock AI’s potential for your organization.

3 responses to “Beyond The Bitter Lesson: Why AI Verification Matters”

  1. […] We explored how simulation and verification are revolutionizing AI deployments, from Google’s …. Now, we want to show you exactly how we’re applying this playbook to enterprise security. […]

  2. […] AI security companies building agents that interact with Okta, this realism isn’t optional. You can’t train an agent on fake data where users teleport between cities or where device fing… The agent learns those patterns are normal, and then completely fails when encountering real Okta […]

  3. […] A system can be non-deterministic and still reliable. Consider the hardware systems we describe in Beyond the Bitter Lesson, systems operating on non-deterministic models that still deliver reliable enough performance to […]

Leave a Reply

Trending

Discover more from Kashikoi

Subscribe now to keep reading and get access to the full archive.

Continue reading