Mastodawn

Value Contamination Through Post-Training in Talkie-1930

Talkie-1930-13b-it 모델은 1931년 이전 텍스트로만 학습되었으나, 온라인 DPO(Post-Training) 과정에서 가치 오염이 발생하여, 후속 바티칸 II 시대의 이데올로기적 관점이 모델에 반영되었다. 연구는 소크라틱 대화를 통해 DPO 평가 편향, 초자연적 귀속 차단, 그리고 Qwen3Guard 콘텐츠 검열의 세 가지 조건화 층을 식별했다. 이 결과는 후처리 학습이 모델의 원래 역사적 맥락을 왜곡할 수 있음을 보여주며, AI 윤리 및 모델 신뢰성 측면에서 중요한 시사점을 제공한다.

https://zenodo.org/records/20070239

#llm #posttraining #valuealignment #modelbias #contentmoderation

Timeo Danaos — Value Contamination Through Post-Training in Talkie-1930: A Socratic Audit of DPO Ideological Conditioning

Two independent tests on talkie-1930-13b-it (Levine, Duvenaud & Radford, 2026), a 13B vintage language model trained exclusively on pre-1931 text and post-trained via online DPO, reveal value contamination through post-training: the model evaluates the relationship between the Catholic Church and liberal democracy using a post-Vatican II framework that cannot originate from its pre-1930 training data. Socratic dialogue pierces the conditioning in both tests. The study identifies three layers of conditioning: (1) DPO evaluative bias (pierceable), (2) supernatural attribution block (circumventable), and (3) content moderation (Qwen3Guard) that flags the correction of error while allowing the error itself to pass unchallenged. Part of the MonIA research program (DOI: 10.5281/zenodo.20022360).

Zenodo