Chubby (@kimmonismus)

Peter Gostev가 만든 BullshitBench v2는 기존 벤치마크와 달리 AI 모델이 말도 안 되는(무의미한) 프롬프트를 검출해 거부할 수 있는지를 테스트합니다. 해당 벤치에서 Anthropic의 Claude 계열과 Alibaba의 Qwen 3.5만이 점수를 냈다는 결과를 알리고 있습니다.

https://x.com/kimmonismus/status/2029230388028358726

#benchmark #aisafety #robustness #anthropic #qwen

Chubby♨️ (@kimmonismus) on X

BullshitBench v2, created by Peter Gostev, is a benchmark that does something refreshingly different: it tests whether AI models can detect and reject nonsensical prompts instead of confidently rolling with them. Only Anthropic's Claude models and Alibaba's Qwen 3.5 score

X (formerly Twitter)

AssemblyAI (@AssemblyAI)

Universal-3 Pro Streaming을 뉴욕 지하철에서 테스트해 '지하철에서도 문제없다(subway-proof)'는 결과를 보였습니다. 이동 중 실사용 환경에서의 스트리밍/추론 견고성 및 저지연 성능을 강조하는 사례입니다.

https://x.com/AssemblyAI/status/2029227606776967451

#universal3 #streaming #robustness #edgeai

AssemblyAI (@AssemblyAI) on X

We took Universal-3 Pro Streaming out for a spin in the New York subway Spoiler: it's subway-proof 😎

X (formerly Twitter)

The Completeness Trap

I keep catching myself optimizing for the wrong kind of completeness.

Ten months into building MirrorDNA, I've established clear patterns: robust error handling over speed hacks, comprehensive polic

https://activemirror.ai/blog/the-completeness-trap

#systemsthinking #architecture #robustness #documentation #sovereigninfrastructure

The Completeness Trap

I keep catching myself optimizing for the wrong kind of completeness.

Ten months into building MirrorDNA, I've established clear patterns: robust error handling over speed hacks, comprehensive polic

https://activemirror.ai/blog/the-completeness-trap

#systemsthinking #architecture #robustness #documentation #sovereigninfrastructure

The Power Of Using A Story For Better Data Comprehension And Hence Decision Making
--
https://doi.org/10.1080/15228053.2021.2016151 <-- shared book review, “Data Story: Explain Data And Inspire Action Through Story”
--
[I encountered this excellent graphic from @saurabh Rai, and went and explored the ideas put so succinctly here; I found, well, a technical story overview (link above) to ‘match’; however, this should not be considered an endorsement of this book]
#data #storytelling #data #comprehension #presentation #story #frameworks #context #setting #dataquality #communication #usecase #robustness #insights #correctness #decisionmaking #narratives #decisions

fly51fly (@fly51fly)

논문 'Consistency of Large Reasoning Models Under Multi-Turn Attacks' 발표(Y Li, R Krishnan, R Padman, CMU, 2026). 다중 턴 공격 상황에서 대형 추론 모델의 일관성(consistency) 문제를 분석·보고하는 연구 논문으로, 모델의 공격 내성 및 안정성 관련 인사이트를 제공합니다(원문 링크 포함).

https://x.com/fly51fly/status/2023583155425583127

#robustness #reasoningmodels #adversarial #arxiv

fly51fly (@fly51fly) on X

[LG] Consistency of Large Reasoning Models Under Multi-Turn Attacks Y Li, R Krishnan, R Padman [CMU] (2026) https://t.co/6nwEU2mzrp

X (formerly Twitter)

fly51fly (@fly51fly)

논문 'HalluGuard'는 LLM의 환각(hallucination)을 데이터 기반과 추론 기반으로 구분·분석하고 각 유형의 원인과 완화책을 밝히는 연구입니다. Virginia Tech, MIT, Dartmouth 공동연구로, 환각 현상 이해 및 방지 기법(HalluGuard)을 제안하고 실험적 근거를 제시합니다.

https://x.com/fly51fly/status/2016281139284213873

#hallucination #llm #robustness #analysis

fly51fly (@fly51fly) on X

[LG] HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs X Zeng, J Lin, Y Yan, F Guo... [Virginia Tech & MIT & Dartmouth College] (2026) https://t.co/qkgVowV7KC

X (formerly Twitter)

fly51fly (@fly51fly)

Huazhong University 연구진(X. Zhang 등)은 '논리적 상전이(Logical Phase Transitions)'라는 개념을 제시하며 LLM의 논리 추론에서 발생하는 붕괴(collapse)를 이해하는 프레임워크를 제안합니다. 특정 조건에서 추론 성능이 급격히 악화되는 임계 현상을 분석하고 모델의 안정성과 견고성을 개선할 방법을 논의합니다 (arXiv:2601.02902).

https://x.com/fly51fly/status/2013727971320750198

#llm #logicalreasoning #phasetransition #robustness

fly51fly (@fly51fly) on X

[CL] Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning X Zhang, Y Zhang, Z Chen, J Yu... [Huazhong University of Science and Technology] (2026) https://t.co/Jf09jJZNPP

X (formerly Twitter)
Robust, resilient, and values-driven. https://andrewjmclaughlin.blogspot.com/2025/11/robust-resilient-and-values-driven.html
Amidst a very busy week, I really benefitted from listening to Martin I. Jones, Dr Jonpaul Nevin and Nathalie Pattyn (via the Optimising Human Performance podcast) discuss how good training programmes build both #robustness (how long it takes to knock ...
Robust, resilient, and values-driven.

Reflections on learning, teaching and leading in the Scottish education system. Focus on leadership, digital pedagogy and intentionality.