Notes from trying to build Thea on European AI

Build European. Sovereign AI. Get off the US stack. We agree with the sentiment, and we tried to build on it.

Thea, our litigation agent, ran on Mistral from the first commit. We expected trade-offs around price or quality. What we actually ran into was something else, and it’s the part of the sovereignty conversation we don’t see discussed much: whether the infrastructure underneath can carry a real product under real load.

Here’s what we found.

What we chose, and what we expected

Thea turns a case file into a sourced timeline. We wanted that work running on European infrastructure, and Mistral was the natural choice.

Two things worth putting on record before we get to the problem:

Cost was not the trade-off. Mistral is cheaper than OpenAI for our workload.

Quality was not the trade-off. Our own evaluations on Thea’s extraction tasks showed Mistral matching OpenAI. Pulling dated events from legal documents, attributing them to the right people, linking them to source.

So the variable we were left holding was something else.

What “trying” actually looked like

Every workaround on the list below took days to land. Most started with a production incident.

We replaced the official SDK with raw fetch() because it hung indefinitely in production. We extended timeouts to forty-five minutes because legal extraction regularly ran seven to eighteen minutes per window. We retreated from concurrent processing to sequential after hitting rate limits, then reintroduced concurrency with adaptive backoff. We added JSON sanitization because the model returned malformed responses often enough to break the parser. We built recall-based retries after the orchestrator missed dates the extraction windows had already found.

Upstream 500s came back with no diagnostic information, just silence. We added error logging. We doubled the retry-after value on rate-limit errors because the unmodified one kept firing too early. Eventually we removed a whole retry layer because it doubled cost and runtime while only catching 2.5% of edge cases.

Weeks of work that should have gone into the product went into keeping the pipeline standing up. None of it made Thea better. All of it was the cost of staying on the stack we wanted to be on.

Three hours to one minute

After all of it, we ran the same corpus through both stacks. Identical prompt. Identical documents. 528K tokens in, Dutch legal timeline out.

Mistral: two hours and eighteen minutes. OpenAI: one hundred and four seconds. Roughly eighty times faster.

The more interesting number is the split. We isolated the two variables — swap the model while holding the config constant, then tune the config while holding the model constant. The model swap alone accounted for a 39× speedup. The config tuning on top was worth another 2×. Almost all of the gap was infrastructure, not prompt engineering or batch sizing.

Put plainly: nothing about how we wrote our code or structured the requests was the problem. The bottleneck was the infrastructure serving the model, and switching providers was a one-line change.

One environment variable.

This is not a model problem

Mistral the model is fine. What’s missing is everything underneath: chips, bandwidth, capacity, headroom under real load.

There’s a quiet irony in that — Europeans are part of the problem we’re complaining about. The surge of users seeking AI sovereignty after the political shifts of the last year is itself part of what’s pushing European providers past their capacity. Demand has run ahead of the physical infrastructure. More of us want the sovereign option. The sovereign option doesn’t yet have the warehouses to serve us.

Reliability, not speed

Lawyers will wait. Submit at five, find a timeline in your inbox the next morning. That’s not a worse product, it’s often a better one. We built for exactly that: background processing, email when done, close the tab and come back.

The wall we hit wasn’t latency. It was reliability. Even at hours-long timeouts, even running overnight, jobs would still fail. Silent timeouts. Malformed responses. Rate limits that appeared only under load. Retries that doubled cost without improving the success rate.

The worst failure mode wasn’t the one that errored. It was the one that didn’t. On the same benchmark run, thirty percent of the case file silently went missing from the final timeline. No error surfaced. The output looked complete. You wouldn’t catch it unless you compared the timeline against the source documents yourself. Underneath, six of twenty extraction windows had returned zero events — the model hit its output limit and the fallback path retried against itself, producing nothing each time. On the OpenAI run, same config: zero failures.

A user who submits a case and finds their work waiting will submit the next one. A user who submits a case and gets an error after the meeting they needed it for will not.

That’s the trust line. No interface design saves you from it.

Where this leaves us

Mino’s job is to take mechanical work off a litigator’s desk. We can wrap a long-running job in a thoughtful background flow. We can’t wrap an unreliable result in anything useful.

So we made the call. Mistral where it delivers: Feitlijn for case law search, Nina for lighter reasoning, workloads where failure modes are bounded. OpenAI where throughput and reliability matter: Thea, Garry, the heavy work.

A note on what “moving to OpenAI” means here. We run it through Azure’s EU deployments, not the US endpoints — the data stays in European datacenters. Sovereignty is layered: model origin, inference infrastructure, data residency. This move changes the first; the data layer doesn’t move.

We’re not writing this to complain. We’ll be among the first customers back when the infrastructure catches up to the model. We want that day to arrive. It’s why we tried in the first place.

The honest case

The model is good. The price is competitive. What stands between European builders and a sovereign stack they can actually ship on isn’t a model or a vendor. It’s permits, grid capacity, datacenter approvals, chip supply, and political appetite to build all of that at speed.

That work happens at municipal councils approving substations. It doesn’t trend. But it’s the actual bottleneck, and the one worth naming if we want a sovereign stack we can ship on.

And if you’re a legal team trying to make calls like this without spending weeks on trial-and-error: that’s what we built Mino for.