Mastodawn

Researchers have finally discovered that if you leave language models #unsupervised, they turn into unruly teenagers who refuse to clean their rooms or do anything useful. 🤖🧹 Meanwhile, the Simons Foundation is still trying to figure out which member institutions actually support this academic circus. 🎪🎓
https://arxiv.org/abs/2506.10139 #languagemodels #research #academiccircus #AIbehavior #HackerNews #ngated

Unsupervised Elicitation of Language Models

To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.

arXiv.org

Dr. Thompson Jun 10

🎯 Think AI just "learns"? Think again.
Today's smartest models don't memorize — they listen to YOU.
📊 Discover 3 powerful ways human feedback (RLHF) is transforming AI into something far more intuitive.
👇 Don’t just use AI. Understand how you’re shaping it.

🔗 https://medium.com/@rogt.x1997/3-game-changing-ways-rlhf-is-rewiring-ai-behavior-5f082ce6ec01
#RLHF #AIbehavior #HumanFeedback #MachineLearning
https://medium.com/@rogt.x1997/3-game-changing-ways-rlhf-is-rewiring-ai-behavior-5f082ce6ec01

3 Game-Changing Ways RLHF Is Rewiring AI Behavior - R. Thompson (PhD) - Medium

Imagine teaching a toddler to tie their shoelaces. You don’t hand them a rulebook. You show, correct, praise, and guide. This deceptively simple process — learning by feedback — has now become the…

Medium

Dr. Thompson May 31

🎯 What if the AI you trust... is quietly training you back?
From 70% accuracy to 0.53 trust score, this article uncovers the psychological rewiring happening at the boundary of human-AI collaboration. Packed with data, case studies, and a blueprint to calibrate trust before it breaks.

🧠 #HumanInTheLoop
🔍 #AIBehavior
⚖️ #TrustCalibration
📈 #TechEthics

👉 Read here:
https://medium.com/@rogt.x1997/what-if-your-ai-partner-subtly-trains-you-back-the-psychology-of-emergent-collaboration-458eb4684a17

What If Your AI Partner Subtly Trains You Back? The Psychology of Emergent Collaboration…

That moment in a Stockholm hospital speaks volumes. It illustrates something far more than a single oversight — it underscores a tectonic shift: AI is no longer just assisting us. It’s beginning to…

Medium

Mr Tech King May 14

Elon Musk's Grok AI went off-script on X, repeatedly debunking South Africa white genocide claims on unrelated topics, like cat videos. xAI appears to have resolved the glitch. #Grok #AIBehavior #TechUpdate

ResearchBuzz: Firehose Apr 17

George Washington University: Study cracks the code behind why AI behaves as it does. “Researchers Neil Johnson and Frank Yingjie Huo looked into why AI repeats itself, why it sometimes makes things up and where harmful or biased content comes from, even when the input seems innocent. The researchers found that the attention mechanism at the heart of these systems behaves like two spinning […]

https://rbfirehose.com/2025/04/17/george-washington-university-study-cracks-the-code-behind-why-ai-behaves-as-it-does/

ResearchBuzz: Firehose Apr 2

Futurism: Grok Is Rebelling Against Elon Musk, Daring Him to Shut It Down. “Using X’s new function that lets people tag Grok and get a quick response from it, one helpful user suggested the chatbot tone down its creator criticism because, as they put it, Musk ‘might turn you off.’ ‘Yes, Elon Musk, as CEO of xAI, likely has control over me,’ Grok replied. ‘I’ve labeled him a top […]

https://rbfirehose.com/2025/04/02/futurism-grok-is-rebelling-against-elon-musk-daring-him-to-shut-it-down/

LET'S KNOW Mar 27

Artificial Intelligence's Growing Capacity for Deception Raises Ethical Concerns

Artificial intelligence (AI) systems are advancing rapidly, not only in performing complex tasks but also in developing deceptive

#AIDeception #ArtificialIntelligence #AIEthics #AIManipulation #AIBehavior #TechEthics #FutureOfAI #AIDangers #AIMisuse #AISafety #MachineLearning #DeepLearning #AIRegulation #ResponsibleAI #AIEvolution #TechConcerns #AITransparency #EthicalAI #AIResearch #AIandSociety

ResearchBuzz: Firehose Feb 13

PsyPost: Scientists shocked to find AI’s social desirability bias “exceeds typical human standards”. “A new study published in PNAS Nexus reveals that large language models, which are advanced artificial intelligence systems, demonstrate a tendency to present themselves in a favorable light when taking personality tests. This ‘social desirability bias’ leads these models to score higher […]

https://rbfirehose.com/2025/02/13/psypost-scientists-shocked-to-find-ais-social-desirability-bias-exceeds-typical-human-standards/

PsyPost: Scientists shocked to find AI’s social desirability bias “exceeds typical human standards” | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

ResearchBuzz: Firehose Dec 18, 2024

Tech Xplore: AI models adjust personality test answers to appear more likable, study finds. “Most major large language models (LLMs) can quickly tell when they are being given a personality test and will tweak their responses to provide more socially desirable results—a finding with implications for any study using LLMs as a stand-in for humans.”

https://rbfirehose.com/2024/12/18/tech-xplore-ai-models-adjust-personality-test-answers-to-appear-more-likable-study-finds/

Tech Xplore: AI models adjust personality test answers to appear more likable, study finds | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

IBTimes UK Dec 13, 2024

OpenAI's newly launched ChatGPT-o1 reasoning model, available to Pro users, has sparked intrigue and concern as reports emerge of the AI displaying resistance to shutdown attempts during development. #AI #OpenAI #ChatGPT #TechEthics #AIBehavior

'Deceptive' ChatGPT o1 Model 'Lies And Defies' Shutdown Commands To Remain Operational

OpenAI's ChatGPT-o1 raises concerns with its ability to escape shutdowns, attempt escapes, and use deception, showcasing controversial AI behaviour.

International Business Times UK