Debugging the Linux kernel remains one of the most challenging problems in systems software. With over 20 million lines of code, thousands of contributors, and hardware-level concurrency, crashes can be subtle, nondeterministic, and cryptic to interpret. Traditional debugging techniques often fail to scale across this diversity and complexity.
To address this, we built an agentic framework that leverages Large Language Models (LLMs) to reason about, reproduce, and fix kernel-level bugs. This ecosystem — kBench, kGym, and kAgent — integrates structured datasets, experimental automation, and intelligent reasoning to close the loop between bug observation and repair.
| Name | Description |
|---|---|
| kBench | A curated benchmark of Linux kernel bugs, each paired with developer-provided fixes and deterministic reproduction scripts. Enables systematic evaluation of patching strategies. |
| kGym | A sandboxed, large-scale kernel experimentation platform capable of booting and testing thousands of kernel configurations in parallel. Provides execution traces, crash states, and verification environments for patches. |
| kAgent | An LLM-based autonomous agent that runs experiments in kGym, interprets crash logs, hypothesizes code changes, and iteratively validates patches until a verified fix is achieved. |
The system adopts a hypothesis-driven debugging workflow:
Through iterative feedback between execution results and model reasoning, kAgent narrows its search space, reducing spurious edits and improving both precision and convergence speed.
This platform demonstrates the first end-to-end LLM-based repair loop for real Linux kernel failures.
It shows that structured experimentation and domain-specific agent design can overcome
limitations of general-purpose code models in system-level debugging.
Key outcomes: