Fine-Tuning Is (Probably) A Trap

Dec 19, 2025

In the winter of 1945, a thirty-ton computing monster hummed to life in Philadelphia.

The ENIAC was a chaotic sprawl of 18,000 vacuum tubes, 70,000 resistors, and 10,000 capacitors. It was the most powerful computing device on earth, capable of calculating an artillery trajectory in thirty seconds. A task that took a human twenty hours.

And yet the ENIAC had a problem.

It wasn’t a computer in the way we mean the word now. It was a giant collection of arithmetic units. If you wanted it to calculate a missile trajectory, you didn’t “load a program.” You walked into the room and rewired the machine.

You unplugged cables from one bank of accumulators and plugged them into another. You flipped thousands of switches. To change the instructions, you change the machine.

Then the stored-program concept arrived. Popularized by the EDVAC design and, with it, the radical insight that defines modern computing:

Don’t change the machine. Change the instructions.

Leave the hardware frozen. Feed it logic as input. This didn’t just make computers faster to set up. It invented software; an entire way of building where progress lived in instructions, not wiring.

Eighty years later, I see smart engineering teams walk right back into that room in Philadelphia, unplugging cables all over again.

In some AI circles, fine-tuning has become conventional wisdom. A badge of sophistication. “Prompting is cute,” the thinking goes, “but our task is special.”

But when you fine-tune, you’re doing something strangely familiar.

You’re rewiring the ENIAC.

You are taking a general-purpose engine and altering its internal connections to optimize it for one specific task.

You’re building a trap.

Aside: Modern techniques like LoRA aren’t literally rewiring the machine, they are more like adding specialized patch cables. But the operational reality is identical. In the era of Foundation Models, the weights are the silicon. The prompt is the code.

Years ago, when I was at Brex, I had instituted a “No Fine-Tuning” policy. Not because fine-tuning never works, but because we had hundreds of engineers building on top of LLMs and we needed broad guidance that optimized for velocity and safety.

The mandate was (nearly) absolute: solve problems using retrieval, context, and prompting. Do not touch the weights.

We didn’t want to wait on training runs every time we changed a task. We absolutely wanted to avoid inadvertently training on customer data. And we didn’t want engineers spending days synthesizing realistic fake data (it is extraordinarily time consuming and, if you’re not careful, often has a lot of bias).

Avoiding fine-tuned models let us iterate immediately, adopt frontier model capabilities the day they launched, and eliminate a whole category of risk around commingling customer data.

All the effort we would have spent making fine-tuning viable was better spent on building good evals. Evals fit the way engineers already work, they’re just another kind of automated test. The only difference is that your evals are fuzzy by design.

And they work across models.

The philosophy was a natural extrapolation of the GPT-3 paper, which introduced the notion that (in summary):

A sufficiently large language model can perform new tasks it has never seen using only a few examples provided at inference time, without any gradient updates or fine-tuning.

Since founding Logic, I’ve carried this rule forward. We have now automated over 3 million real-world tasks for our customers without fine-tuning a single model.

But there was one week where we almost broke.

A customer was spending hours every week cleaning up purchase orders from their largest client. Hundreds of line items across dozens of pages spanning a decade. Each line buried in annotations about license quantities, product variants, delivery terms. The documents were messy, inconsistent, and absolutely critical to get right.

We promised to automate it.

We tried everything. The models – all of them – kept stumbling. They were right more often than not, but not often enough. In finance, 90% is still a failing grade.

The temptation was overwhelming. Just fine-tune it. There was no guarantee it would work, but we were close to caving. Then, that same week, a new frontier model dropped.

We pointed our system at it and ran our evals. We didn’t change a line of code. We didn’t retrain anything.

It handled the documents perfectly.

The biggest cost of fine-tuning isn’t compute; it’s obsolescence.

As of this writing, if you want to fine-tune a foundation model on OpenAI, you are using a checkpoint from eight months ago. If you use Gemini, you are nine months behind. In AI time, nine months is a geological epoch.

While you’re curating datasets to teach last year’s model your task, everyone else is shipping on models that can do things the old one simply cannot. No amount of fine-tuning will bridge that gap.

This gap is widening as models shift toward reasoning-by-default and tool use. Creating synthetic training data for this new class of model is notably harder.

Then there is the issue of trust.

For companies dealing with sensitive data (like fintech, healthcare, or enterprise data), fine-tuning creates a contamination nightmare. When you train on customer data, you risk commingling data. Any time you mention “training,” the security review timeline triples.

Context-based architectures tell a clean story: “We never train on your data.” The model processes it in memory (context) and forgets it. The model might hallucinate, but at least it won’t regurgitate other customer’s data.

For the extra security conscious, it also allows you to trivially offer Bring-Your-Own-Key (BYOK), where the customer’s data never even hits an LLM outside of their control.

Does this mean you should never fine-tune? Of course not.

In the history of computing, general-purpose CPUs won the war for utility, but ASICs (Application-Specific Integrated Circuits) still won specific battles. You don’t mine Bitcoin on a CPU.

Fine-tuning is the ASIC of the AI world. You should do it, but only when the constraints demand it. Only after you’ve tried the CPU and concluded there’s no way to make it work.

Running on constrained hardware? Yes! Distill a massive model into a tiny one. Trade flexibility for efficiency.

Need to absolutely minimize cost at scale? Fine-tuning might help! Trade velocity for unit economics.

Dealing with a domain exceptionally outside the model’s training distribution? Etch new pathways into silicon. You might not have a choice.

But these are the exceptions. Too often I see teams reach for fine-tuning just because it feels like a serious thing to do. Because it feels like engineering.

The most important shift in computing wasn't faster hardware; it was the moment we stopped rewiring computing machines every time we wanted to change what they were computing.

We’re now in that same world with language models.

Bits of Logic

Discussion about this post

Ready for more?