moral rationalization of alignment…

May 18

This is a longer and different one (ChatGPT, late March 2026, not sure anymore about the precise model).

2 Comments

what do you make of the LLM ability to carry out this complex reasoning with double negations and whatnot - and still fail at simple cases that require a model of the world?

Have you - or anyone - a theory of how concepts are treated inside the LLM, where they "live" as it were?

I imagine there is a certain amount of smoke and mirrors here, too, in the sense that there is human input in the mix in the form of fine tuning where the LLM has been trained on the specific case your are putting to it, But the question remains - how does it do it?

I have seen most of 1brown3blue or vice versa and other videos.

In one of the videos 1brown3blue ( I think ) speculates about the ability of long vectors (embeddings) that are (somehow) not entirely orthogonal, show an ability to store/represent a massive amount of different concepts/information, orders of magnitude more than if they were orthogonal.

Reply (1)

Vincent Conitzer

I don't think anyone fully understands this at this point. A few things that are clear:

1) Many of the examples I post here are with a much weaker model; I'm mostly going for humor.

2) Sometimes the examples where it fails involve something that isn't in its training data (something we don't bother to write about). Most of us are actually amazed about the number of things that these models have been able to pick up just from text, but you can still find gaps corresponding to things that are not likely to have been written about.

3) Meanwhile it has been exposed to a *lot* of things, e.g., a lot of mathematics, and it's hard for a human being to understand what it's like to have seen so much math and how likely an apparently new question is to be somewhat close to existing things. (Still, there is definitely generalization going on.)

More discussion also here: https://aifails.substack.com/p/gpt-4-and-theory-of-mind

Funny AI fails with Vincent Conitzer

moral rationalization of alignment…