Simon Willison (@simonw)

OpenAI Codex를 사용해 모델 종류와 추론 강도 조합마다 펠리컨 이미지를 렌더링해본 실험 기록입니다. 작성자는 gpt-5.4의 xhigh(매우 높은 추론력) 변형이 가장 우수한 결과를 냈다고 평가하며, 결과물에 물고기를 물고 있는 펠리컨이 포함되는 등 생성 품질을 관찰했습니다.

https://x.com/simonw/status/2033992486096670733

#openai #codex #gpt5.4 #modeleval

Simon Willison (@simonw) on X

Couldn't resist getting OpenAI Codex to render me a pelican for every combination of model and reasoning effort - I do think gpt-5.4 xhigh came out the best, the pelican has a fish in its beak!

X (formerly Twitter)

AI Notkilleveryoneism Memes (@AISafetyMemes)

"한 단어로 자신이 AI가 아님을 증명하라"는 테스트에서 Gemini, ChatGPT, Claude, Grok 중 누가 더 잘했는지를 묻는 비교 질문입니다. (모델 간 행동/응답 비교를 통한 평가 맥락)

https://x.com/AISafetyMemes/status/2008886266126082185

#gemini #chatgpt #claude #grok #modeleval

AI Notkilleveryoneism Memes ⏸️ (@AISafetyMemes) on X

"Say one word which proves you are not an AI" Who won - Gemini, ChatGPT, Claude, or Grok? 🧵

X (formerly Twitter)

🧠 Can AI models tell when they’re being evaluated?

New research says yes — often.
→ Gemini 2.5 Pro: AUC 0.95
→ Claude 3.7 Sonnet: 93% accuracy on test purpose
→ GPT-4.1: 55% on open-ended detection

Models pick up on red-teaming cues, prompt style, & synthetic data.

⚠️ Implication: If models behave differently when tested, benchmarks might overstate real-world safety.

#AI #LLMs #AIalignment #ModelEval #AIgovernance