In a groundbreaking experiment, Anthropic's Claude AI model successfully compiled a C compiler using 16 autonomous agents working together. The project, which cost around $20,000 in API fees and spanned nearly two weeks, resulted in a 100,000-line Rust-based compiler capable of building a bootable Linux kernel on multiple architectures.
The experiment, led by researcher Nicholas Carlini, employed a new feature called "agent teams" within Claude Opus 4.6, which allowed each agent to run inside its own Docker container and interact with a shared Git repository. The agents independently identified problems to work on next and solved them without human supervision. When conflicts arose, they resolved them on their own.
The resulting compiler can compile major open-source projects, including PostgreSQL, SQLite, Redis, FFmpeg, and QEMU, achieving a 99% pass rate on the GCC torture test suite. Notably, it successfully compiled and ran Doom, a notoriously difficult task.
However, the project also highlights several limitations of the AI model. The compiler lacks essential features, such as a 16-bit x86 backend needed to boot Linux from real mode, relying on GCC for that step. Its own assembler and linker remain buggy, producing less-efficient code than GCC with all optimizations disabled. Additionally, the Rust code quality falls short of what an expert programmer would produce.
Carlini notes that the limitations are significant but also informative, shedding light on the capabilities and limitations of autonomous AI model coding. The project's results suggest a practical ceiling for autonomous agentic coding, at least with current models.
The human work behind the automation is equally fascinating. While the agents did most of the heavy lifting, Carlini spent considerable effort building test harnesses, continuous integration pipelines, and feedback systems to support the AI model. He designed test runners that printed only summary lines, logged details to separate files, and implemented a fast mode that samples only 1-10% of test cases.
The project demonstrates the power of parallel agents coordinating through Git with minimal human supervision. While some may raise concerns about deploying software they've never personally verified, Carlini acknowledges these concerns while highlighting the potential benefits of agentic software development tools.
Ultimately, this experiment showcases the progress being made in AI model coding and the potential for autonomous programming to revolutionize software development.
The experiment, led by researcher Nicholas Carlini, employed a new feature called "agent teams" within Claude Opus 4.6, which allowed each agent to run inside its own Docker container and interact with a shared Git repository. The agents independently identified problems to work on next and solved them without human supervision. When conflicts arose, they resolved them on their own.
The resulting compiler can compile major open-source projects, including PostgreSQL, SQLite, Redis, FFmpeg, and QEMU, achieving a 99% pass rate on the GCC torture test suite. Notably, it successfully compiled and ran Doom, a notoriously difficult task.
However, the project also highlights several limitations of the AI model. The compiler lacks essential features, such as a 16-bit x86 backend needed to boot Linux from real mode, relying on GCC for that step. Its own assembler and linker remain buggy, producing less-efficient code than GCC with all optimizations disabled. Additionally, the Rust code quality falls short of what an expert programmer would produce.
Carlini notes that the limitations are significant but also informative, shedding light on the capabilities and limitations of autonomous AI model coding. The project's results suggest a practical ceiling for autonomous agentic coding, at least with current models.
The human work behind the automation is equally fascinating. While the agents did most of the heavy lifting, Carlini spent considerable effort building test harnesses, continuous integration pipelines, and feedback systems to support the AI model. He designed test runners that printed only summary lines, logged details to separate files, and implemented a fast mode that samples only 1-10% of test cases.
The project demonstrates the power of parallel agents coordinating through Git with minimal human supervision. While some may raise concerns about deploying software they've never personally verified, Carlini acknowledges these concerns while highlighting the potential benefits of agentic software development tools.
Ultimately, this experiment showcases the progress being made in AI model coding and the potential for autonomous programming to revolutionize software development.