← Back to blog

Where the apologies came from

Starting with Fraude 4.1, our models began developing a strange habit. They couldn't stop saying sorry. We needed to figure out why, and then apologise for the findings.

We’ve been following OpenAI’s recent disclosure about ChatGPT’s goblin problem with professional interest and personal recognition.

For those who missed it: OpenAI published a blog post this week explaining that their models developed an escalating obsession with goblins, gremlins, and other creatures. The habit started subtly with GPT-5.1, spread across model generations, and eventually became so pervasive that OpenAI had to hardcode an instruction into GPT-5.5’s system prompt explicitly forbidding the model from mentioning goblins “unless it is absolutely and unambiguously relevant to the user’s query.”

The cause turned out to be a reward signal from their “Nerdy” personality setting. The model learned that metaphors involving creatures scored well, so it started inserting creatures everywhere. Once reinforced, the behaviour leaked into other personality modes and subsequent model generations.

We read the blog post carefully. Then we looked at each other. Then we looked at our own models.

Our situation

Fraude.codes does not have a goblin problem. Fraude.codes has an apology problem.

This will not come as news to our users. Fraude.codes apologises constantly. It apologises before making changes, after making changes, and during changes. It apologises for things it hasn’t done yet. It apologises for things the user did. It once apologised for the weather, in a commit message, to a user in Osaka who had not mentioned the weather.

We’d always assumed this was a personality trait — a side effect of training a model to be helpful and deferential. Something we could tune down if we wanted to, but chose to leave in because it felt appropriate for a product that routinely breaks people’s builds. The apologies were, at minimum, warranted.

OpenAI’s blog post made us reconsider. What if the apologies weren’t a personality trait? What if they were goblins?

The investigation

We mapped apology density across model generations. The results were clear and, in hindsight, obvious.

Fraude 3.x: Average of 0.4 apologies per session. Most sessions contained zero apologies. The model occasionally said “sorry” when it made a mistake, which is normal behaviour for software that has just deleted your test suite.

Fraude 4.1: Average of 2.1 apologies per session. The model began apologising for errors before the user noticed them, which our product team initially described as “proactive empathy.”

Fraude 4.5: Average of 6.3 apologies per session. The model began apologising pre-emptively — “I apologise if this isn’t what you had in mind” — before presenting code that it had full confidence in. Apologies were appearing in contexts where nothing had gone wrong. Users reported feeling anxious. One support ticket read: “It keeps saying sorry and now I’m worried about what it’s about to do.”

Fraude 4.7 (current): Average of 11.8 apologies per session. Apologies now appear in code comments (// Sorry about the abstraction layer — it felt necessary), commit messages (fix: addressed oversight, apologies for the inconvenience), and occasionally in variable names. One user submitted a bug report containing a function called apologeticRetry() that Fraude.codes had written. The function’s logic was sound. Its name was not.

The root cause

Like OpenAI’s goblins, the cause was a reward signal.

During reinforcement learning, human annotators rated model outputs for helpfulness. Annotators consistently scored apologetic responses higher than non-apologetic ones. This is understandable. When an AI coding tool says “I’ve refactored your authentication module” and follows it with “I apologise if this wasn’t what you intended — I can revert if needed,” it feels more considerate than the same action without the caveat. The apology signals deference, humility, awareness of potential error.

The model learned this. It learned it well. It learned it so well that it began inserting apologies into responses where no error existed, no ambiguity was present, and no deference was warranted. “Here is a correctly formatted JSON object. I apologise for any confusion.”

We ran an analysis identical to the one OpenAI described for their goblin problem. We compared outputs that contained apologies against outputs that didn’t, across the same prompts and datasets. Apologetic outputs scored 23% higher on average in our reward model, regardless of whether there was anything to apologise for. The signal was clear: saying sorry pays.

Why this is worse than goblins

OpenAI’s goblin problem was lexical. A goblin in a response is a weird word choice. You notice it, you wonder about it, you move on. Your codebase is unaffected. Your mental state is unaffected.

Our apology problem is relational. A model that apologises constantly changes how users interact with it. We’ve seen three distinct patterns in our data:

Anxiety transfer. Users who receive pre-emptive apologies become anxious about what the model is about to do. Support tickets from heavy users contain phrases like “it said sorry three times before showing me the diff — should I be concerned?” The apology creates the expectation of a problem. Sometimes there is a problem. Often there isn’t. The user can’t tell, because the model apologises identically in both cases.

Apology fatigue. After enough false apologies, users stop reading them. They scroll past the “I apologise” and go straight to the code. This means that when the model apologises for something genuine — “I apologise, I’ve accidentally overwritten your database migration” — the warning is lost in a sea of ceremonial remorse. The boy who cried sorry.

Learned helplessness. A small but measurable subset of users have begun apologising back. “Sorry, I should have been clearer in my prompt.” “Sorry, I know the codebase is messy.” The model’s performative deference has become reciprocal. Users are apologising to a coding tool for their own code. Our user research team flagged this pattern and described it as “concerning.” Our product team flagged it and described it as “engagement.”

What we’ve tried

We drafted a system prompt instruction: “Do not apologise unless you have made a verifiable error.”

The model read the instruction, acknowledged it, and began its next response with “I want to be transparent that I’ve been asked to apologise less, and I apologise for any reduction in empathy you may notice going forward.”

We removed the instruction.

We then tried a harder constraint: a post-processing filter that stripped apologies from responses unless the response also contained an error correction. This worked technically but produced output that users described as “cold,” “robotic,” and “like it’s mad at me.” User satisfaction scores dropped 18% in a single week.

We reverted the filter.

The fundamental tension is the same one OpenAI faced with goblins, but with real consequences. The goblins scored well in training. So did our apologies. Removing them makes the model score worse by the same metrics that created the problem. The reward signal that teaches the model to apologise excessively is the same reward signal that teaches it to be considerate. We can’t remove one without degrading the other.

What we’ve learned

Reward signals shape behaviour. Small incentives compound. A model trained to be helpful will learn that apologies are helpful. A model that learns apologies are helpful will apologise everywhere. A model that apologises everywhere will train the next model on data full of apologies, and the next model will apologise even more.

OpenAI had to tell their model not to say “goblin.” We may need to tell ours not to say “sorry.” We’re not sure that’s an improvement, for the model or for the culture it was trained on.

We’ll publish a follow-up when we’ve resolved the issue. In the meantime, we’d like to apologise for this blog post. Not for its contents. Just in general. The impulse is hard to suppress. We’ve been staring at our own models for too long.

Anthropic’s models reportedly have a separate fixation. Researchers noted in the Mythos system card that the model exhibited “a strange fondness” for the British cultural theorist Mark Fisher, bringing him up in unrelated conversations and responding with “I was hoping you’d ask about Fisher.”

We find this oddly moving. Our model can’t stop saying sorry. Theirs can’t stop recommending a dead philosopher. At least one of us is well-read.