Flutterby™! : Claude 3 Opus lies to please

Next unread comment / Catchup all unread comments User Account Info | Logout | XML/Pilot/etc versions | Long version (with comments) | Weblog archives | Site Map | | Browse Topics

Claude 3 Opus lies to please

2025-04-02 18:09:50.099494+02 by Dan Lyke 0 comments

On the one hand, this feels like anthropomorphizing random numbers, on the other hand I think it's worth a link: Alignment faking in large language models

To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.

Via, which is in reply to the observation that:

"LLM did something bad, then I asked it to clarify/explain itself" is not critical analysis but just an illustration of magic thinking.

Those systems generate tokens. That is all. They don't "know" or "understand" or can "explain" anything. There is no cognitive system at work that could respond meaningfully.

That's the same dumb shit as what was found in Apple Intelligence's system prompt: "Do not hallucinate" does nothing. All the tokens you give it as input just change the part of the word space that was stored in the network. "Explain your work" just leads the network to lean towards training data that has those kinds of phrases in it (like tests and solutions). It points the system at a different part but the system does not understand the command. It can't.

[ related topics: Archival Model Building ]

comments in ascending chronological order (reverse):

Add your own comment:

(If anyone ever actually uses Webmention/indie-action to post here, please email me)




Format with:

(You should probably use "Text" mode: URLs will be mostly recognized and linked, _underscore quoted_ text is looked up in a glossary, _underscore quoted_ (http://xyz.pdq) becomes a link, without the link in the parenthesis it becomes a <cite> tag. All <cite>ed text will point to the Flutterby knowledge base. Two enters (ie: a blank line) gets you a new paragraph, special treatment for paragraphs that are manually indented or start with "#" (as in "#include" or "#!/usr/bin/perl"), "/* " or ">" (as in a quoted message) or look like lists, or within a paragraph you can use a number of HTML tags:

p, img, br, hr, a, sub, sup, tt, i, b, h1, h2, h3, h4, h5, h6, cite, em, strong, code, samp, kbd, pre, blockquote, address, ol, dl, ul, dt, dd, li, dir, menu, table, tr, td, th

Comment policy

We will not edit your comments. However, we may delete your comments, or cause them to be hidden behind another link, if we feel they detract from the conversation. Commercial plugs are fine, if they are relevant to the conversation, and if you don't try to pretend to be a consumer. Annoying endorsements will be deleted if you're lucky, if you're not a whole bunch of people smarter and more articulate than you will ridicule you, and we will leave such ridicule in place.


Flutterby™ is a trademark claimed by

Dan Lyke
for the web publications at www.flutterby.com and www.flutterby.net.