Mastodawn

HiL-Bench는 AI 에이전트가 무엇이 부족한지 인지하고 질문할 줄 아는지를 검증하는 최초의 벤치마크입니다. 최첨단 모델은 완전한 사양에서는 잘하지만 핵심 정보를 일부 제거하면 자신있게 그럴듯한 오답을 내보냅니다. 최근 GPT-5.5, Opus 4.7, Kimi K2.6를 리더보드에 추가했고, 현재 이런 현상이 관찰되고 있습니다.

https://x.com/ScaleAILabs/status/2051333688798097567

#benchmark #aiagents #uncertainty #modelevaluation #gpt5.5

Scale Labs (@ScaleAILabs) on X

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We

X (formerly Twitter)

Harald Sack 5d ago

So, what did we learn in last week's lecture?
(1) The bounty log (history of AI)
(2) Symbolic vs subsymbolic (The two schools)
(3) The mechanics of the chase (ML types)
(4) The black box evaluation
Stay tuned for this weeks lecture on traditional ML technologies (k-Means, linear regression, decision trees)

#AI #machinelearning #cowboybebop #HistoryOfAI #modelEvaluation @fiz_karlsruhe @fizise #lecture #KDAI2026

sayzard Apr 27

Mojofull (@furoku)

새로운 모델이 나올 때마다 METI 주도로 각 산업별로 검증을 수행하자는 제안이다. AI 도입 확산에 맞춰 산업 전반의 안전성·성능 평가 체계를 구축하자는 내용으로, 향후 AI 정책 및 모델 검증 인프라 논의와 관련성이 높다.

https://x.com/furoku/status/2048739550818906421

#ai #modelevaluation #meti #policy #verification

Mojofull (@furoku) on X

素晴らしい実務レベルの検証。新しいモデルが出たら経産省主導でこのような検証が各業界で行われると良いと思う。AIの護送船団方式。

X (formerly Twitter)

sayzard Apr 2

Microsoft Research (@MSFTResearch)

ADeLe는 AI 모델의 핵심 능력을 여러 작업 요구사항과 비교해, 아직 경험하지 않은 작업에서도 성능을 정확히 예측할 수 있게 하는 프레임워크다. Nature에 게재된 연구로, 모델 평가와 일반화 성능 예측에 유용한 새로운 AI 연구 결과다.

https://x.com/MSFTResearch/status/2039387314204406125

#ai #modelevaluation #research #nature #machinelearning

Microsoft Research (@MSFTResearch) on X

ADeLe profiles AI models across a set of core abilities and compares them to task requirements. Published in @Nature, this framework enables accurate prediction of model performance on tasks they have not encountered before: https://t.co/zkvHQjfUNI

X (formerly Twitter)

Show thread

Eric Maugendre about data Mar 22

About metrics for measuring agreement on regression on continuous datasets:
Reasons to avoid R² and use RMSE instead: https://feat.engineering/03-Review_of_the_Modeling_Process.html#sec-reg-metrics

From Max Kuhn @topepo, Kjell Johnson (2026), "Feature Engineering and Selection: A Practical Approach for Predictive Models"

#prediction #dataDev #modelEvaluation #regression #modelling #linearRegression #modeling #probability #probabilities #statistics #stats #gotcha

3 A Review of the Predictive Modeling Process – Feature Engineering and Selection: A Practical Approach for Predictive Models

sayzard Mar 12

Cursor (@cursor_ai)

에이전트형(agentic) 코딩 과제에서 모델을 평가하는 새로운 점수화 방법을 공개했습니다. 이 방법으로 Cursor 내 여러 모델의 지능(intelligence)과 효율성(efficiency)을 비교한 결과를 공유한다고 알리며, 코딩 에이전트 성능 평가에 대한 새로운 벤치마크 또는 메트릭 제안을 포함합니다.

https://x.com/cursor_ai/status/2032148125448610145

#cursor #modelevaluation #agenticcoding #aicoding

Cursor (@cursor_ai) on X

We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:

X (formerly Twitter)

sayzard Mar 12

Artificial Analysis (@ArtificialAnlys)

Grok 4.20 베타가 세 가지 주요 개선을 도입했다고 발표했습니다. 특히 AA-Omniscience 평가에서 역대 최저 환각률을 기록했으며, 모델이 정답을 모를 때 잘못된 답을 제시한 비율이 22%에 불과하다고 보고합니다. 전반적으로 응답 정확도와 안정성이 크게 향상되었다는 내용의 모델 업데이트 공지입니다.

https://x.com/ArtificialAnlys/status/2032190330783875147

#grok #llm #modelevaluation #hallucination

Artificial Analysis (@ArtificialAnlys) on X

The Grok 4.20 Beta shows three major improvements over Grok 4: ➤ Our lowest ever hallucination rate on the AA-Omniscience evaluation. When Grok did not know the answer, it hallucinated an incorrect answer 22% of the time - this is the lowest hallucination rate of any model we

X (formerly Twitter)

sayzard Mar 10

swyx (@swyx)

“Build a company that benefits from the models getting better and better”라는 @sama 인용과 함께, 작성자는 Devin(문구상 'devin brain')이 수십 개의 modelgroups를 사용해 각 모델을 광범위하게 평가해 하니스(harness)에 포함시키고 몇 달마다 완전 리라이트를 한다고 전함. 커뮤니티에서 'devin is good now'라는 긍정적 피드백이 많이 들린다는 관찰을 공유함.

https://x.com/swyx/status/2030853776136139109

#samaltman #modelevaluation #mlops #llm

swyx (@swyx) on X

"Build a company that benefits from the models getting better and better" — @sama devin brain uses a couple dozen modelgroups and extensively evals every model for inclusion in the harness, doing a complete rewrite every few months. hearing a lot of "devin is good now" feedback

X (formerly Twitter)

sayzard Feb 28

MRLN (@mrlnonai)

모델이나 시스템이 SVG 출력 등 특정 벤치마크에서 성능을 과대 포장('benchmaxxed')하고 훈련비용을 단 1000만 달러라고 주장하는 사례에 대한 비판적 코멘트입니다. 실제로는 그래픽카드 등 장비 인수 비용이 포함되지 않고 에너지비만 계산되는 등 비용 산정의 왜곡 가능성을 지적하고, 사용 시 추론 토큰이 많아진다는 주장을 담고 있습니다.

https://x.com/mrlnonai/status/2027891857942831218

#benchmarking #trainingcosts #modelevaluation #svg #ml

MRLN (@mrlnonai) on X

@kimmonismus and it will be benchmaxxed on svg outputs and other benchmarks and then claim ONLY 10 MILLION dollar used for training even though they acquired for all money they got the graphics cards and they only pay energy costs. and then if you use it you have reasoning tokens like hell

X (formerly Twitter)

sayzard Feb 27

Deeban R, PhD (@Deeban)

Anthropic가 배포된 모델들에 대해 자체적으로 'sabotage evaluations'(사보타주 평가)를 수행하고 결과를 공개했습니다. 주요 발견은 '미래의 치명적 결과에 크게 기여할 수 있는 잘못 정렬된 자율적 행동의 위험은 매우 낮지만 완전히 무시할 수는 없다'는 점입니다. 이는 AI 안전성 관점의 중요한 평가 결과입니다.

https://x.com/Deeban/status/2027329314577125596

#anthropic #aisafety #modelevaluation #sabotageevaluation

Deeban R, PhD (@Deeban) on X

Worth recalling: @AnthropicAI ran sabotage evaluations on their own deployed models and published the results. The finding: "Very low, but not completely negligible, risk of misaligned autonomous actions that contribute significantly to later catastrophic outcomes." First

X (formerly Twitter)