Full paper on Zenodo, code open source on GitHub.
Thread with details ↓
#AIAgents #LLMs #OpenScience
Traditional human beta testing remains a costly, time-intensive, and logistically complex phase of the software development lifecycle. Recruiting 10 to 30 testers, managing feedback cycles over two to four weeks, and triaging unstructured reports typically costs between $2,000 and $10,000 per release while still leaving significant defect classes undiscovered. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code comprehension, reasoning about software behavior, and generating structured analytical output. This paper introduces CAST (Coordinated Agent Swarm Testing), a methodology that substitutes coordinated swarms of LLM-powered agents for human beta testers. CAST organizes 100 independent AI agents into five specialized squads of 20, each agent instantiated with a unique persona that defines its expertise, behavioral archetype, and testing focus. Agents perform static code analysis against the full source of a target application and return structured findings in a controlled JSON schema. An aggregation layer deduplicates findings using content fingerprinting and computes cross-squad confidence scores. We present a case study applying CAST to GingerPen, a 12,500-line React/TypeScript book formatting platform spanning 54 source files. The framework completed a full 100-agent analysis in approximately 63 minutes at a cost of $7.97, producing 772 raw findings that reduced to 687 unique issues after deduplication. Findings spanned accessibility, error handling, state management, security, performance, and data integrity categories. We compare CAST against estimated human beta testing costs and timelines and find that it achieves comparable defect coverage for code-level issues at approximately 1% of the cost. We discuss the methodology's strengths in systematic coverage and reproducibility, its limitations in visual and runtime testing, and future directions including browser-automated visual agents, multi-model ensembles, and CI/CD integration. CAST is proposed not as a full replacement for human testing but as a cost-effective first pass that catches a substantial fraction of pre-launch defects, fundamentally shifting how development teams approach quality assurance.