NEWS SQLite, Nginx, and Apache fell in a single day. AI agents outperformed the world’s best hackers

ExcalibuR

Legend
LEGEND
PREMIUM
MEMBER
Joined
Jan 17, 2025
Messages
4,031
Reaction score
7,804
Deposit
11,800$
SQLite, Nginx, and Apache fell in a single day. AI agents outperformed the world’s best hackers
1755475975235.png

The agents uncovered software flaws where humans and traditional tools had failed.​


Developing AI agents capable of discovering vulnerabilities in complex systems remains a difficult task, usually requiring significant manual effort. But such agents have a key advantage: unlike traditional methods such as fuzzing or formal verification, their work can be read directly from logs. This allows researchers to better understand the strengths and weaknesses of modern LLM models. The authors of the experiment collected over a hundred gigabytes of such logs and highlighted several striking cases.


The first target was SQLite, a lightweight and extremely popular C-based database used in browsers, mobile OSs, cars, aircraft, and even within the CRS engine itself. In the practical round of the AIxCC competition, the agents uncovered not only intentionally planted vulnerabilities but also real bugs. Among them were two serious flaws fixed by developers on August 5. One was a classic buffer overflow in the zipfile extension (enabled by default). The bug allowed memory to be overwritten when handling archives, and was nearly impossible to detect through random fuzzing. Another flaw in the same code caused out-of-bounds reads when opening corrupted ZIP files.


Next came FreeRDP, an open-source implementation of the Remote Desktop Protocol. In addition to spotting planted issues such as an obfuscated backdoor, the agents also detected a real-world bug: a signed integer overflow when processing client monitor information. Remarkably, even hours of fuzzing with libfuzzer failed to trigger this issue — but carefully generated AI-driven input succeeded.


Similar experiments were carried out with other widely used projects — Nginx, Apache Tika, and Apache Tomcat. Logs show how the AI system iteratively proposed fixes, stumbled over ambiguous patching strategies, and eventually succeeded, sometimes taking tens of minutes and several dollars’ worth of compute. In some cases, the agents found unusual exploitation paths — for instance, when unable to bypass protections in ZIP handling, they switched to TAR archives instead.


The authors emphasize that these experiments are valuable not just for finding bugs, but also for fine-tuning the agents themselves — their tools, workflows, and division of roles. While not all discovered issues were critical, the practice demonstrates that LLM systems can indeed identify and reproduce vulnerabilities overlooked by classical techniques. And although this process is still far from fully automated, it already gives researchers a fundamentally new perspective on the security of widely used software.
 
Top Bottom