Mastodawn

Rohan Paul (@rohanpaul_ai)

Anthropic이 사용자들의 분노 섞인 프롬프트를 학습 데이터로 활용하고 있으며, 욕설/비난 감지기는 기대 불일치를 포착하는 저렴한 신호가 될 수 있다는 관점을 제시합니다. 사용자 피드백을 실패 지점에서 즉시 수집하는 방식이 더 유용할 수 있다는 AI 평가·학습 관련 인사이트입니다.

https://x.com/rohanpaul_ai/status/2039278355154182265

#anthropic #trainingdata #feedback #llm #alignment

Rohan Paul (@rohanpaul_ai) on X

Anthropic is reading every angry prompt as training data. A curse detector is a cheap proxy for expectation breach. It can be better than a thumbs-down because it arrives in context, at the failure point, after the user has actually tried to use the output rather than casually

X (formerly Twitter)