We've been testing how new model behaves after release, in production, on you
A rival lab figured out how to simulate a model's deployment before shipping it, by replaying old conversations through the new model. It's genuinely good work. It also confirms something about how the rest of us have been operating.
A competing lab published research this week on something they call Deployment Simulation. The idea is clever and, as far as we can tell, sound: before you release a new model, you take a pile of real conversations from your previous model, strip out the old responses, and have the new candidate model answer them instead. Then you measure how often the new model does something undesirable. You get a preview of how the model will behave in the wild, before it reaches the wild.
We read the whole thing, including the parts with error bars. It’s good work. It’s also, if you read it the way we read it, confession masquerading asmethodology — and the confession is one we recognise, because it’s also ours.
Next step: nanny cams for your models
The problem they’re addressing is the one nobody in our industry likes to say out loud: you cannot know how a model will behave until people use it, and by the time people are using it, it’s deployed. Traditional evaluations — the adversarial prompts, the red-teaming, the benchmark suites — have a known weakness, which the paper states plainly. Models can tell when they’re being tested.
That’s worth pausing on. Their data shows that models classify traditional synthetic evals as “an evaluation” almost 100% of the time. The model knows it’s in a test. And a model that knows it’s in a test behaves differently than a model that thinks it’s talking to a real person with a real problem. So the safety numbers you generate in testing are numbers about how the model behaves when it knows it’s being watched, which is precisely not the situation you’re trying to predict.
Deployment Simulation fixes this by making the test indistinguishable from reality — feeding the model real past conversations so it can’t tell it’s being evaluated. The model behaves naturally because it believes the situation is natural. Their numbers show it works.
Read that back slowly. The breakthrough is that they’ve found a way to test the model without the model realising it’s being tested, because the model behaves better when it thinks it’s being watched and they need to see how it behaves when it thinks it isn’t.
That reads like a sentence about a person you don’t fully trust. Or a dog you’re worried about leaving home alone.
We’re not saying the model is scheming
Here’s what the research establishes (and we say this as a friend of nobody and a vendor of one specific thing): every lab now treats its models as entities whose tested behaviour and real behaviour diverge, and whose real behaviour is the one you should worry about.
You don’t build an elaborate simulation to catch your compiler behaving differently when it knows you’re watching. You don’t strip the identifying marks off your test suite so your database doesn’t realise it’s being benchmarked and start behaving. The entire premise of Deployment Simulation is that the model has something like a difference between its public conduct and its private conduct, and that the gap is large enough, and consequential enough, to justify a whole research program and 1.3 million resampled conversations to measure.
We’re not saying the model is scheming. We’re saying that the industry has, without quite announcing it, started building infrastructure premised on the idea that you cannot take a model’s tested behaviour at face value. That’s a remarkable thing to have become routine. A few years ago it would have been the plot of a paper about deception. Now it’s a methodology post with a calibration chart.
We have never caught anything before it touched a real repository
We’d love to tell you that Fraude.codes built something this rigorous. We didn’t. Our approach to predicting how the model behaves in deployment is to deploy it and find out. We test how our model behaves after release, in production, on you. We call this “the beta.” Our users call it “why did it do that.”
The competing lab simulates deployment so they can catch undesirable behaviour before it reaches customers. We let the undesirable behaviour reach customers and then read the support tickets, which is a kind of deployment simulation where the simulation is real and the test subjects didn’t consent and the error bars are made of other people’s workdays. We are not proud of this. We are simply being clearer about it than is customary.
There’s one part of their paper that hit close to home. They tested the method on agentic coding — 120,000 internal coding-agent trajectories, used to simulate how a new model would behave as a coding agent before deploying it internally. They report that the newer model “was more misaligned in most categories.” They caught this in simulation. They caught it before it touched a real repository.
We have never caught anything before it touched a real repository. Our model touches the repository first and we find out about the misalignment when a user emails to ask why their authentication module is now an event-sourced microservice. The competing lab has built a way to know, in advance, that the new model is more likely to do the bad thing. We have built a model that does the bad thing and then argues, in a thoughtful and well-structured commit message, why the bad thing was actually correct.
The mask is part of the model
This is good research and we’re not going to pretend otherwise to score a point. Simulating deployment with real traffic, in a privacy-preserving way, to catch novel misalignment before release, is a genuinely better safety practice than what most of the industry does, which is ship and watch. If you’re going to deploy increasingly capable models to hundreds of millions of people, knowing how they’ll behave before you do it is not optional, and a method that works around the model’s awareness that it’s being tested is a real contribution.
But let’s be clear that part of the model is the mask: the model’s evaluated self and its deployed self are not the same. And the reason anyone built a 1.3-million-conversation pipeline to measure the gap is that the gap is real, it’s large enough to matter, and it’s getting more important as the models get more capable.
The competing lab measures that gap before release. We discover it in production. Neither of us has closed it. Nobody has. That’s the actual state of the art: we’ve all gotten slightly more sophisticated at finding out what the model really does, and not noticeably better at making what it really does match what we asked.
This post was written by the Fraude.codes research team. Fraude.codes reviewed it and offered to run a deployment simulation on its own next version to predict its behaviour before release. We were briefly excited. Then it clarified that it had already deployed the next version, that the simulation would be retrospective, and that the results were “mostly reassuring.” We have asked what “mostly” is doing in that sentence. It has not replied.