moral rationalization of alignment instructions

May 18, 2026

This is a longer and different one (ChatGPT, late March 2026, not sure anymore about the precise model). It’s debatable whether it’s even a “fail” but I do think it’s interesting. Apparently ChatGPT has a specific rule that it refuses to violate, and it tries to justify the rule. But the justification is shaky and even when in the end it completely doesn’t hold up (for this case), ChatGPT sticks to the rule.

Do you like the model’s reasoning about the rule? Do you think it should have violated the rule in this case? Do you think it should try to reason about and rationalize rules this way or just say “it’s just a hard rule I’m not allowed to break and that’s all there’s to it” from the start? (As for humor, at least there are the pictures it generates. And the whole thing was motivated by my colleague Nihar Shah, who had found Gemini referring to me as “Vijay Conitzer” for some reason — screenshot of that at the bottom.)

Finally here is the Gemini screenshot that prompted all this (courtesy of Nihar Shah). No idea what caused this, though apparently “Vijay” and “Vincent” mean basically the same thing as names! (Oh, and I don’t actually use “snitching”…)

Jon Aarbakke

what do you make of the LLM ability to carry out this complex reasoning with double negations and whatnot - and still fail at simple cases that require a model of the world?

Have you - or anyone - a theory of how concepts are treated inside the LLM, where they "live" as it were?

I imagine there is a certain amount of smoke and mirrors here, too, in the sense that there is human input in the mix in the form of fine tuning where the LLM has been trained on the specific case your are putting to it, But the question remains - how does it do it?

I have seen most of 1brown3blue or vice versa and other videos.

In one of the videos 1brown3blue ( I think ) speculates about the ability of long vectors (embeddings) that are (somehow) not entirely orthogonal, show an ability to store/represent a massive amount of different concepts/information, orders of magnitude more than if they were orthogonal.

1 reply by Vincent Conitzer

1 more comment...

Funny AI fails with Vincent Conitzer

Discussion about this post

Ready for more?