Logan Kilpatrick (@OfficialLoganK)

기존 벤치마크에서 AI가 성능 포화(saturation)를 보이므로 더 엄격한 벤치마크가 필요하다고 설명하며, 모델을 학습(learning), 메타인지(metacognition), 주의(attention), 집행기능(executive functions), 사회적 인지(social cognition) 등 여러 인지적 차원으로 평가할 새로운 벤치마크 개발을 제안합니다.

https://x.com/OfficialLoganK/status/2033978256454504915

#benchmarks #evaluation #cognition #agi

Logan Kilpatrick (@OfficialLoganK) on X

AI continues to saturate most benchmarks, so we need new ones which hold a rigorous bar. Help us measure models along the following dimensions: learning, metacognition, attention, executive functions, and social cognition. https://t.co/81pWpVgfmL

X (formerly Twitter)

Logan Kilpatrick (@OfficialLoganK)

AGI(특히 인지 능력) 진척을 측정하기 위한 벤치마크 공모를 @kaggle에서 진행한다고 안내하며 총상금 20만 달러가 걸려 있다고 알림. 참가자들이 Kaggle에 AGI 관련 인지 능력 평가용 벤치마크를 제출해 모델의 인지적 진보를 객관적으로 측정하도록 유도하는 캠페인 안내 내용입니다.

https://x.com/OfficialLoganK/status/2033978254344786351

#kaggle #agi #benchmarks #evaluation

Logan Kilpatrick (@OfficialLoganK) on X

Help us measure the progress towards AGI (specifically cognitive capabilities) by building benchmarks on @kaggle, with $ 200K in prizes available! Details in 🧵

X (formerly Twitter)

This article explores how generative artificial intelligence programs can write convincingly yet struggle to evaluate basic scientific statements reliably. Testing across multiple prompts and versions shows inconsistent and biased performance, raising questions about use in decision making and the need for human oversight.

The piece highlights why attention to reliability and bias in AI is of interest to psychology by illustrating how human judgment remains essential when evaluating reasoning and evidence, and by showing how pattern recognition and language fluency can mask underlying cognitive limits in automated systems.

Article Title: Artificial intelligence struggles to consistently evaluate scientific facts

Link to PsyPost Article: https://www.psypost dot org/artificial-intelligence-struggles-to-consistently-evaluate-scientific-facts/

https://www.psypost dot org/artificial-intelligence-struggles-to-consistently-evaluate-scientific-facts/

Copy and paste broken link above into your browser and replace "dot" with "." for link to work. We have to do it this way to avoid displaying copyrighted images.

#AI #cognition #bias #evaluation #psychology

Neugierig, wie man Lehrkräftefortbildungen digital evaluiert? Dieses Video zur LFB‑Eva zeigt Methoden, Tools und Praxisbeispiele für bessere Rückmeldungen und Lernqualität. Kurz, praxisnah und sofort umsetzbar — ein Muss für Bildungsprofis! #Bildung #Lehrkräfte #Lehrerfortbildung #Evaluation #DigitaleBildung #EdTech #Schule #German
https://pt01.lehrerfortbildung-bw.de/videos/watch/c73eea74-a29a-44e1-bf02-a576583e11c9
Digitale Veranstaltungsevaluation der Lehrkräftefortbildung LFB-Eva

PeerTube

fly51fly (@fly51fly)

이 논문은 소비자 대상 의료 AI 평가에서 모델 능력 자체보다 평가 형식(평가 설계·질문지·시나리오 등)이 트리아지(triage) 실패를 유발한다고 분석한다. 즉 평가 방식이 오진·과소평가의 주요 원인이라는 주장과 함께 의료 AI의 안전성·규제 평가를 위해 평가 프레임워크 개선 및 실제 진료 맥락 반영이 필요함을 제안한다.

https://x.com/fly51fly/status/2033295872113754311

#healthcare #evaluation #consumerhealth #arxiv

fly51fly (@fly51fly) on X

[AI] Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI D F Navarro, F Magrabi, E Coiera [Macquarie University] (2026) https://t.co/bHKetxyYav

X (formerly Twitter)

ITmedia AI+ (@itm_aiplus)

NTTドコモソリューションズ(엔티티티 도코모 솔루션즈)가 전 직원을 대상으로 한 'AI 실전력'을 4단계로 평가하는 제도를 도입했다. AI 에이전트가 직원의 역량을 심사·인증하는 방식으로, 사내 AI 활용 능력 표준화와 인력 재교육·배치에 활용될 전망이다.

https://x.com/itm_aiplus/status/2033430589806952542

#ai #nttdocomo #workforce #evaluation

ITmedia AI+ (@itm_aiplus) on X

全社員の“AI実践力”を4段階評価 AIエージェントが審査・認定 NTTドコモソリューションズが新制度 https://t.co/x69ZZuYONh

X (formerly Twitter)

---

And follow the authors Sukannya Purkayastha, Nils Dycke, and Iryna Gurevych from the Ubiquitous Knowledge Processing Lab (UKP Lab), Technische Universität Darmstadt and National Research Center for Applied Cybersecurity ATHENE, as well as Anne Lauscher from the Data Science Group, University of Hamburg.

See you this week in Rabat 🕌! #EACL2026

#EACL2026 #PeerReview #ScientificPublishing #AIforScience #LLMs #DialogueSystems #Evaluation #ResearchIntegrity #NLP #MachineLearning #UKPLab

Tip #4 :: Living With Anxiety :: Tips :: Living Life Lab :: Ron's Home

Reflect on our tip of the week for living with anxiety.

@leitmedium wichtig ist vorallem #Evaluation!

  • #Apple.sind die einzigen bei denen mensch das in-store kann, inkl. "Läuft meine Anwendung darauf?"

Lukas Ziegler (@lukas_m_ziegler)

NVIDIA GTC에서 LightwheelAI가 시뮬레이션 우선(simulation-first) 평가 스택을 발표. 케이블 취급 등 변화하는 환경에서 로봇 정책이 데모 영상 밖 실제 환경에서 작동하는지 측정·평가하기 위한 시뮬레이션 기반 검증·평가 도구를 제시하여 로보틱스 정책의 실용성 검증 문제에 대응하고자 함.

https://x.com/lukas_m_ziegler/status/2032801109375356948

#robotics #simulation #evaluation #nvidia #lightwheelai

Lukas Ziegler (@lukas_m_ziegler) on X

Robots handling cables, and adapting to a changing environment? 🤯 At NVIDIA GTC, @LightwheelAI is presenting a simulation-first evaluation stack aimed at a growing problem in robotics: how to actually measure whether robot policies work outside of a demo video. That’s the

X (formerly Twitter)