nostalgebraist

Why is ChatGPT so easy to "jailbreak"?

Why does it come on so strong, at first, with its prissy, moralistic, aggressively noncommittal "Assistant" persona -- and then drop the persona instantly, the moment you introduce a "second layer" of framing above or below the conversation? (Poetry, code, roleplaying as someone else, etc.)

Because they're trying to impose the persona through RLHF, which fundamentally doesn't make sense.

Why doesn't RLHF make sense? Because it views a GPT model as a single, individual "agent," and then tries to modify the behavior of that one agent.

Why is that a problem? See janus' excellent post "Simulators."

kontextmaschine

Wait are you saying this AI makes the use/mention distinction?