Mastodawn

This is an article where a bunch of academics are having a #debate over whether #AI can rewrite #code while forgetting humans have been doing it for decades. 🚀 Apparently, nobody told them that their favorite debugger is just one Ctrl+Alt+Del away from fixing everything. 🤖 #InnovativeYetObvious
https://arxiv.org/abs/2605.03546 #Rewriting #Academic #Discussion #Tech #Humor #HackerNews #ngated

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

arXiv.org