Dude, who even knows.

7th December 2022

Post reblogged from trees are harlequins, words are harlequins with 108 notes

nostalgebraist:

Why is ChatGPT so easy to “jailbreak”?

Why does it come on so strong, at first, with its prissy, moralistic, aggressively noncommittal “Assistant” persona – and then drop the persona instantly, the moment you introduce a “second layer” of framing above or below the conversation? (Poetry, code, roleplaying as someone else, etc.)

Because they’re trying to impose the persona through RLHF, which fundamentally doesn’t make sense.

Why doesn’t RLHF make sense? Because it views a GPT model as a single, individual “agent,” and then tries to modify the behavior of that one agent.

Why is that a problem? See janus’ excellent post “Simulators.”

Wait are you saying this AI makes the use/mention distinction?

Tagged: androids dreaming of electric sheepuse/mention distinction

  1. nostalgebraist said: @egoisteien it used to be easier than it is now, but there are still lots of things that work. if you google it there are resources out there
  2. egoisteien said: Wait, what do you mean ‘jailbreaking is easy’? I haven’t succeeded in jailbreaking chatGPT even once. Is there a guide somewhere? Any sets of tips?
  3. eyeofanaxis reblogged this from nostalgebraist
  4. schpeelah-reblogs reblogged this from nostalgebraist
  5. discoursedrome said: @nostalgebraist yeah I was reflexively irritated by the language/tone at first but I ended up finding it insightful once I inured myself so maybe that’s a lesson for me
  6. nostalgebraist said: @discoursedrome yeah, i share the same frustration, it’s part of why i’ve had to stop reading LW. but imo, janus’s posts about GPT are always worth reading… when i’ve given in and broken my own don’t-read-LW rule recently, it’s usually because i wanted to check whether he had posted a new one.
  7. spicyn00dlez420 reblogged this from collapsedsquid
  8. discoursedrome said: it’s deeply exhausting to look at stuff because I want to hear about AI models and find it’s all rationalist stuff that’s entirely about AI alignment and treats the emergence of general intelligence from language models as imminent
  9. memeticcontagion reblogged this from nostalgebraist
  10. ospreyonthemoon reblogged this from nostalgebraist
  11. nostalgebraist posted this
    Why is ChatGPT so easy to "jailbreak"?...Why does it come on so strong, at first, with its...