Bruno Gavranović

Home Posts Papers Research Programme About

From Alchemy to Chemistry

Software testing can only prove the presence of bugs, not their absence.

This asymmetry limits how far we can scale. As software grows, each new component introduces failure modes that become impractical to detect by testing, making it difficult to ensure our intent is met. This is why software engineering teams have come to rely on a complicated hierarchy of trust and best practices. But these only go so far.

The things we build are made out of clay. Clay houses need constant maintenance, and can only rise so high before collapsing under their weight.1

But testing is not the only way.

Programming languages like Idris and Lean let you express your intent using code, then verify intent is satisfied by compiling. Let me show you what I mean.

Say you want to sort a list. In most languages, you’d write a function with a type like:

sort : List Int -> List Int.

The requirement that the output is sorted is left out. The compiler will reject your code if you return a string instead of a list, but if you return an unsorted list, it stays silent. You have encoded your intent only partially. You have to test because you know something you never told the compiler.

This is how most programmers have been taught to program. But type theory has advanced enough to change this.

In Idris, you can write

sort : (xs : List Int) -> SortedList xs.

Here SortedList xs includes not just a list, but a proof that the list is sorted.2 To implement this function you must have it return both the list and the proof. The bug of returning an unsorted list is rendered impossible: the compiler will simply not let you run that program. If you’ve written a function that satisfies your spec, you don’t need to test it, or have any uncertainty about what it does. If the spec matches your intent, there can be no bugs.3

Languages like Idris and Lean make us rethink programming. Instead of writing code and then manually verifying it, you can encode your intent first, then implement it. The burden shifts from “does my code have bugs?” to “does my specification capture what I actually want?”. If it does, verification becomes mechanical: if it compiles, it’s correct. No testing needed.

We’re no longer building with clay. We’re building with bricks and structural engineering. The reliability of our materials, and the precision of our theory lets us build things people a few centuries ago could not have dreamt of: space stations and rockets. Dependent types, through the same reliability and precision, do the same for software.


So, what’s the catch?

There are two. First, specifying intent is hard. We often don’t know what we want. Writing code, like writing prose, reveals what we actually meant to say.

Second, specifications have a surprising amount of detail. If you’re building a database, under tight deadlines you seldom have the luxury to internalise the mathematics needed to formally state invariants of databases with concrete data, write it up in Lean, and translate shifting business requirements into such a system. This takes incredible effort, and is extremely error prone. No human can do it fast enough, and even if you have a PhD in this topic, the basics take months.

It is much easier to just use MySQL and make your boss happy.

Example of mathematics from the aforementioned paper

Mathematics required to state some database invariants. From Algebraic Databases.

This is why proofs have taken a backseat in mainstream programming. Languages optimised for human convenience - such as Python or Javascript - have taken center stage, while languages optimised for correctness remain niche.

But here’s the thing: more and more code is written by AI. Are we sure AI will keep writing code in languages optimised for human convenience?


Just like new paradigms overturned the once dominant Cobol, the mechanics of scale and proofs will eventually overturn languages made for human convenience. The difference is that most of that code won’t be written by humans.

We’re already seeing it. Transformer-based models are helping mathematicians solve open research problems. It’s only a matter of time before people realise that the same approaches used for theorem proving are applicable to general programming. That’s when we’ll see the next AI wave.

But there are obstacles in the way. LLM-based models excel in data-rich domains, but correctness-first languages are not one of them.

By design, these languages are concise: a few hundred lines of specification can define possibly hundreds of thousands of lines of implementation. This makes them inherently data scarce, and starves LLMs of the training data they require. Is the Bitter Lesson actually bittersweet?

Data scarcity

LLM-based models excel only in data-rich domains. From the paper The Era of Experience.

It’s clear that puzzle pieces are still missing. And data scarcity is only part of the problem. We also have:

  • Blindness to structure: LLMs treat code as an extension of natural language, lacking mechanisms to leverage its structure. Any structure that our code has – usually represented as an inductive, dependently-typed abstract syntax tree – has to be flattened into a linear representation before the LLM can consume it. And when producing outputs, they are always produced in a linear order, strictly autoregressively (left to right), without any ability to backtrack or change previous outputs: models are append only.
  • Compiler integration as an afterthought: When LLMs generate code in dependently-typed languages, they do it vacuum sealed from the programming environment. They can suggest nonexistent imports, and call functions that do not exist. They can generate a hundred lines, only to be told by the compiler that there is an error on line one. That’s not how Idris and Lean programmers write code. They generate code step by step in a tight interaction loop with a compiler. The code that they wrote is elaborated, its full terms are exposed back to the programmer, who now continues to write code conditioned, and constrained by the type and structural information that they’d otherwise have to infer manually. Types aren’t used just to filter out invalid outputs, they’re used to help programmers reason. But output of existing LLMs is not conditioned on the existing proof state, nor constrained by it.
  • Lack of infrastructure: Novel architectures, for instance those that produce correct-by-construction dependent types, are difficult to prototype when the tooling doesn’t support structure. Type-driven development and the level of generic programming possible in dependently-typed languages simply isn’t possible in Python. Building genuinely new architectures requires new tools, and those tools, such as automatic differentiation and tensor processing frameworks are missing from dependently-typed languages. Most attempts to build such tools only replicate what already is, without reimagining what could be.

These problems aren’t blocking. But solving them might unlock orders of magnitude more progress than pure scaling.


My research is centered on making that happen. I work on three fronts:

  1. New architectures: Understanding old, and designing new neural networks that natively consume and produce structured data
  2. Mathematical Foundations: Building the mathematics necessary to state precisely what it means to generalise, especially on data structures recursive in nature
  3. Infrastructure: Building the stack required to train such networks in dependently-typed languages: tensor processing, automatic differentiation and elaborator integration

In all of these, I use category theory, the mathematics of structure and composition, as a central glue.


Why category theory?

Because the organisational principles needed to build better AI systems turn out to be the same principles needed to build better programming languages, and to do science in general.

Any system that generalises well has discovered compositional structure in its domain. To generalise is to discover rules that apply parametrically, regardless of the size or shape of the input. And the more consistent and modular you make your rules, the more it looks like you’re following categorical principles.

I suspect any sufficiently capable reasoning system has, whether by design or discovery, converged on these principles.

This raises a question: what might a formal theory of reasoning look like?


To answer this, it helps to look at how other fields found their foundations. Consider alchemy.

For centuries, alchemists mixed compounds through trial and error, chasing the elixir of life and the philosopher’s stone. But alchemy wasn’t bad, it was the best they knew back then. Alchemists made genuine discoveries. They discovered phosphorus, gunpowder4, and developed lab equipment. But they couldn’t synthesise penicillin. They couldn’t build chemical manufacturing plants. They couldn’t cure diseases at scale.

Something else was required. Over time, we discovered the atom. We built the periodic table. We understood the idea of molecules, bonds, reactions, and thermodynamics.

The result was chemistry, which is a field so different we gave it a new name. Chemistry is reliable, controllable, and scalable in ways alchemy never could be. We predicted new elements before we ran the experiments. We now have petrochemical refineries, automated synthesis with minimal human oversight, and we can capture carbon from the atmosphere. These are incredible advancements, and they weren’t created by scaling up alchemy. They required discovering the foundational principles first, and the development of a new science, one alchemists could not imagine.

Chemistry today is unimaginable to alchemists.

Deep learning today feels like alchemy. We’re making remarkable strides, and LLMs are impressive feats of engineering. But we have very little understanding of their internals. Nobody predicted phenomena like double descent, or the lottery ticket hypothesis – phenomena that seemingly explain a part of their behaviour. We’re collecting an abundance of experimental data, but in our pursuit we have not been able to glue it all into a cohesive theory.


In this pursuit, there is something self-referential.

To find a concise way to express how intelligent systems find good models of the world, it may be necessary to first reason better ourselves: to structure our own thoughts more clearly. If we’re poor learners ourselves, we won’t be able to describe learning well.

Category theory is a tool for exactly this. It’s a way to structure thought that is both general enough to allow us to reason about anything and also precise enough to avoid getting lost in meaning.

The notion of intelligence has often carried a mythical quality, resisting precise formalisation alongside related concepts like consciousness, autopoiesis, and the notion of life itself. This was largely because the tools for formal specification were inadequate. But with the advent of 21st century mathematics, I believe we have the right abstractions to ask the right questions.

Could category theory be the formal theory of reasoning we’re after?


  1. I am a fan of the following quote from Bartosz Milewski’s book Category Theory for Programmers: “There is an unfinished gothic cathedral in Beauvais, France, that stands witness to this deeply human struggle with limitations. It was intended to beat all previous records of height and lightness, but it suffered a series of collapses. Ad hoc measures like iron rods and wooden supports keep it from disintegrating, but obviously a lot of things went wrong. From a modern perspective, it’s a miracle that so many gothic structures had been successfully completed without the help of modern material science, computer modelling, finite element analysis, and general math and physics. I hope future generations will be as admiring of the programming skills we’ve been displaying in building complex operating systems, web servers, and the internet infrastructure. And, frankly, they should, because we’ve done all this based on very flimsy theoretical foundations. We have to fix those foundations if we want to move forward.”↩︎

  2. To sort a list xs we need to provide a list ys, together with proofs that the elements of ys are in ascending order, and that ys is a permutation of xs. For details see this gist.↩︎

  3. You might ask: “What if the function sorts a list, but is not efficiently implemented?” In that case, your intent isn’t to sort a list, it’s to sort a list efficiently, and your specification needs to be refined. That is, the implementation matches the specification, but the specification does not match your intent.↩︎

  4. Did you know that gunpowder was invented as an accidental byproduct from experiments seeking to create the elixir of life?↩︎

Licensed under CC BY-SA 4.0 Site proudly generated by Hakyll. The theme originates in the Hakyll-CSSGarden.