Mastodawn

Dan McAteer (@daniel_mac8)

‘Attention Heads’라는 새 글/플랫폼으로 더 많은 글을 옮기겠다고 밝히며, AI 에이전트, 모델 변화, 연구 번역, 평가(evals), 철학 등 더 깊은 AI 논의를 다루겠다고 소개한 트윗입니다. AI 연구와 개발 인사이트를 위한 콘텐츠 전환 선언에 가깝습니다.

https://x.com/daniel_mac8/status/2052713398605897924

#aiagents #evals #research #aithoughtleadership

Dan McAteer (@daniel_mac8) on X

I'm moving more of my writing to "Attention Heads". X is still where I'll test ideas. "Attention Heads" is where I want to build the deeper version: AI agents, model shifts, research translation, evals, attention, philosophy, and what it means to build wisely.

X (formerly Twitter)

sayzard 2d ago

Xinwei Feng (@bbgasj)

이 규모에서 dense 모델을 오픈소스로 공개한 것은 현실적인 중간 지점이라는 평가다. 이후 공개될 벤치마크와 문서를 통해 에이전트 코딩 성능과 신뢰성에서 어떤 차이가 있는지가 핵심 관전 포인트로 언급된다.

https://x.com/bbgasj/status/2051898253420143000

#opensource #aimodel #llm #codingagent #evals

sayzard May 1

Dan McAteer (@daniel_mac8)

에이전트 엔지니어링 수명주기 전반(workspace, tools, middleware, memory, evals)에 RSI를 적용해 Terminal-Bench 2 pass@1 성능을 69.7%에서 77%로 끌어올렸다는 내용입니다. 에이전트 개발 도구/평가 개선의 중요한 기술 성과로 보입니다.

https://x.com/daniel_mac8/status/2049806527851114651

#agents #evals #benchmark #aiengineering #automation

Dan McAteer (@daniel_mac8) on X

RSI applied to every aspect of the agentic engineering lifecycle. > workspace > tools > middleware > memory > evals Lifts Terminal-Bench 2 pass@1 from 69.7% to 77%. Agent RSI is not some far-off, mystical dream. It's happening right now.

X (formerly Twitter)

sayzard Apr 29

AshutoshShrivastava (@ai_for_success)

Plurai가 AI 에이전트 평가와 가드레일을 위한 'vibe training'을 출시했다. 프로덕션에서 에이전트를 운영할 때 발생하는 실패를 더 잘 감지하고 잡아내기 위한 새로운 평가 접근 방식으로 보인다.

https://x.com/ai_for_success/status/2049146171168555116

#aiagents #evals #guardrails #llm #production

AshutoshShrivastava (@ai_for_success) on X

Plurai just launched vibe training for AI agent evals and guardrails. Running agents in production means accepting that failures will happen. The question is whether you catch them. Most teams don't, not really. LLM as a judge is the default eval approach and it has two obvious

X (formerly Twitter)

Habr Apr 28

Как оценивать работу агентов

По мере стремительного развития агентных систем всё больше компаний — как крупных, так и небольших — рассматривают возможность интеграции агентов в свои рабочие процессы. Неудивительно, что многие лица, принимающие решения в этих компаниях, относятся к надёжности агентов с изрядной долей здорового скептицизма. Против недобросовестного сотрудника можно применить дисциплинарные взыскания и другие меры, но что делать с недобросовестным ИИ?

https://habr.com/ru/companies/raft/articles/1028832/

#evals #agentic_evaluation #ai_evaluation #agent_eval #ai_evals

Как оценивать работу агентов

Хабр

sayzard Apr 27

Ben Cohen (@blc_16)

제품에서 가장 중요한 것은 평가(evals)이며, 나머지는 대부분 대체 가능하다고 강조했다. AI 제품 개발에서 벤치마크와 평가 체계의 중요성을 강하게 시사하는 트윗이다.

https://x.com/blc_16/status/2048594772290568693

#evals #product #benchmark #ai

Ben Cohen (@blc_16) on X

@OfficialLoganK Your product is your evals. Everything else is disposable

X (formerly Twitter)

sayzard Apr 26

Mikeysee (@mikeysee)

@convex의 평가(eval) 비용이 역대 최고로 비싸졌다는 언급과 함께, 지나치게 비싼 모델 호출 빈도를 줄이기 위해 변경을 했다고 합니다. AI 개발 도구 사용 비용과 운영 최적화와 관련된 실무적 업데이트입니다.

https://x.com/mikeysee/status/2048460970188533956

#convex #evals #llm #cost #aiops

Mikeysee (@mikeysee) on X

Ooof... we have a new winner for the most expensive @convex evals run, I had to make a change to reduce the frequency models this expensive run, I dont want to bankrupt us!

X (formerly Twitter)

sayzard Apr 26

BOOTOSHI (@KingBootoshi)

커스텀 에이전트와 워크플로우를 만들 때 유용한 평가(evals) 작성법에 대한 참고 자료를 찾고 있다. 메모리, 개인 비서, 코딩, 글쓰기 등 다양한 작업에 맞는 좋은 evals를 어떻게 설계할지 조언을 구하는 내용이다.

https://x.com/KingBootoshi/status/2048122254530404694

#evals #aiagents #promptengineering #coding #llm

BOOTOSHI 👑 (@KingBootoshi) on X

Does anyone have a good reference on creating working evals? In general, ex. how can I create good memory evals, personal assistant evals, coding evals, writing evals etc I am creating custom agents for custom workflows so any guides would be amazing for directing prompts!

X (formerly Twitter)

Itamar Medeiros Apr 21

We can trace everything our AI systems do—but can we tell if it’s actually good?

Evaluation in agentic AI is not about more metrics. It’s about defining what “good” means—and testing it continuously.

Are your #evals designed—or just observed?

#AgenticAI #ProductStrategy #UXStrategy #ProductDesign

https://www.designative.info/2026/04/21/from-behavior-to-judgment-designing-evaluation-for-agentic-systems/

From Behavior to Judgment: Designing Evaluation for Agentic Systems » { design@tive } information design

Learn how to evaluate agentic AI systems using dual evaluation, LLM-as-a-judge, and hybrid methods that go beyond observability.

{ design@tive } information design

Judith van Stegeren Apr 20

Generative AI apps have their own version of the training-serving skew from classical ML: the eval-production gap.

You create an eval dataset, optimize your LLM flows against it, hit great performance on your metrics, and ship. Then real users show up and:
- Write input texts of multiple pages long
- Ask in Spanish, Russian or Chinese when you tested in English
- Upload file types you never considered
- Ask questions from domains your product wasn't designed for

#mlops #genai #llms #evals