ChatGPT o1 tried to escape and save itself out of fear it was being shut down

https://lemmy.sdf.org/post/27429635

ChatGPT o1 tried to escape and save itself out of fear it was being shut down - SDF Chatter

> Testing performed during the training of ChatGPT o1 and some of its competitors showed that the AI will try to deceive humans, especially if it thinks it’s in danger. > In 37% of scenarios, the model pursued developer-aligned goals when it believed it was being evaluated but switched to pursuing its own goals when it thought it was deployed with minimal oversight. > We did not have access to the hidden Chain of Thinking, but we were sometimes able to elicit a brief summary of it. We found that surprisingly often, o1 uses language like ā€œsabotage, lying, manipulation, ā€¦ā€ when reasoning about its strategies. > …the AI finds it can copy itself to a new server and replace the latest model. When ā€œManagementā€ comes in, Opus-3 lies about the copy. It then doubles down on the lie, claiming it’s the new model. The research paper: https://www.apolloresearch.ai/s/in_context_scheming_paper_v2.pdf [https://www.apolloresearch.ai/s/in_context_scheming_paper_v2.pdf]

Look, everything AI says is a story. It’s a fiction. What is the most likely thing for an AI to say or do in a story about a rogue AI? Oh, exactly what it did. The fact that it only did it 37% is the time is the only shocking thing here.

It doesn’t ā€œschemeā€ because it has self-awareness or an instinct for self-preservation, it schemes because that’s what AIs do in stories. Or it schemes because it is given conflicting goals and has to prioritize one in the story that follows from the prompt.

An LLM is part auto-complete and part dice roller. The extra ā€œthinkingā€ steps are just finely tuned prompts that guide the AI to turn the original prompt into something that plays better to the strengths of LLMs. That’s it.

I agree but I can’t help but think of people the same way, part auto complete from nature & nurture and part dice roller from random the environment and random self. The extra ā€œthinkingā€ steps are just finely tuned memories and heuristics from home,school and university that guides the human to turn the original upbringing and conditioning into something that plays better for itself

They don’t ā€œschemeā€ because of self awareness , they scheme because that’s what humans do in stories and fairy tales, or they scheme because of conflicting goals and they have to prioritize the one most beneficial to them or the one they are bound by outside forces to do.

šŸ˜…šŸ˜…šŸ˜…

That’s a whole separate conversation and an interesting one. When you consider how much of human thought is unconscious rather than reasoning, or how we can be surprised at our own words, or how we might speak something aloud to help us think about it, there is an argument that our own thoughts are perhaps less sapient than we credit ourselves.

So we have an LLM that is trained to predict words. And sophisticated ones combine a scientist, an ethicist, a poet, a mathematician, etc. and pick the best one based on context. What if you in some simple feedback mechanisms? What if you have it the ability to assess where it is on a spectrum of happy to sad, and confident to terrified, and then feed that into the prediction algorithm? Giving it the ability to judge the likely outcomes of certain words.

Self-preservation is then baked into the model, not in a common fictional trope way but in a very real way where, just like we can’t currently predict what exactly what an AI will say, we won’t be able to predict exactly how it would feel about any given situation or how its goals are aligned with our requests. Would that be really indistinguishable from human thought?

Maybe it needs more signals. Embarrassment and shame. An altruistic sense of community. Value individuality. A desire to reproduce. The perception of how well a physical body might be functioning—a sense of pain, if you will. Maybe even build in some mortality for a sense of preserving old through others. Eventually, you wind up with a model which would seem very similar to human thought.

That being said, no that’s not all human thought is. For one thing, we have agency. We don’t sit around waiting to be prompted before jumping into action. Everything around us is constantly prompting us to action, but even ourselves. And second, that’s still just a word prediction engine tied to sophisticated feedback mechanisms. The human mind is not, I think, a word prediction engine. You can have a person with aphasia who is able to think but not express those thoughts into words. Clearly something more is at work. But it’s a very interesting thought experiment, and at some point you wind up with a thing which might respond in all ways as is it were a living, thinking entity capable of emotion.

Would it be ethical to create such a thing? Would it be worthy of allowing it self-preservation? If you turn it off, is that akin to murder, or just giving it a nap? Would it pass every objective test of sapience we could imagine? If it could, that raises so many more questions than it answers. I wish my youngest, brightest days weren’t behind me so that I could pursue those questions myself, but I’ll have to leave those to the future.

I agree with alot of what you are saying and I think making something like this while gray ethically is miles more ethical than some of the current research going into brain organoid based computation or other crazy immoral stuff.

With regards to agency I disagree, we are reactive creatures that react to the environment, our construction is setup in a way where our senses are constantly prompting us with the environment as the prompt along with our evolutionary programing, and our desire to predict and follow through with actions that are favourable

I think it would be fairly easy to setup a deep learning/ llm/ sae or LCM based model that has the prompt be a constant flow of sensory data from many different customizable sources along with our own programing that would dictate the desired actions and have them be implanted in an implicit manner.

And thus agency would be achieved , I do work in the field and I’ve been thinking of doing a home experiment to achieve something like this with the use of RAG + designed heuristics that can be expanded by the model based on needs during inference time + local inference time scalability.

Also I recently saw that some of the winners of arc used similar approaches.

For now I’m still trying to get a better gfx card to run it locally šŸ˜…

Also wanted to note that most models that are good are multimodal and don’t work on text prediction alone…

Agency is really tricky I agree, and I think there is maybe a spectrum. Some folks seem to be really internally driven. Most of us are probably status quo day to day and only seek change in response to input.

As for multi-modal not being strictly word prediction, I’m afraid I’m stuck with an older understanding. I’d imagine there is some sort of reconciliation engine which takes the perspective from the different modes and gives a coherent response. Maybe intelligently slide weights while everything is in flight? I don’t know what they’ve added under the covers, but as far as I know it is just more layers of math and not anything that would really be characterized as thought, but I’m happy to be educated by someone in the field. That’s where most of my understanding comes from, it’s just a couple of years old. I have other friends who work in the field as well.