Overview

Debugging the Linux kernel remains one of the most challenging problems in systems software. With over 20 million lines of code, thousands of contributors, and hardware-level concurrency, crashes can be subtle, nondeterministic, and cryptic to interpret. Traditional debugging techniques often fail to scale across this diversity and complexity.

To address this, we built an agentic framework that leverages Large Language Models (LLMs) to reason about, reproduce, and fix kernel-level bugs. This ecosystem — kBench, kGym, and kAgent — integrates structured datasets, experimental automation, and intelligent reasoning to close the loop between bug observation and repair.

Components

Name	Description
kBench	A curated benchmark of Linux kernel bugs, each paired with developer-provided fixes and deterministic reproduction scripts. Enables systematic evaluation of patching strategies.
kGym	A sandboxed, large-scale kernel experimentation platform capable of booting and testing thousands of kernel configurations in parallel. Provides execution traces, crash states, and verification environments for patches.
kAgent	An LLM-based autonomous agent that runs experiments in kGym, interprets crash logs, hypothesizes code changes, and iteratively validates patches until a verified fix is achieved.

Approach

The system adopts a hypothesis-driven debugging workflow:

Observe a kernel crash and extract structured traces and logs.
Generate hypotheses for potential fault locations using LLM reasoning.
Propose and apply plausible patches to the codebase.
Validate candidate patches within kGym until the crash is fully resolved.

Through iterative feedback between execution results and model reasoning, kAgent narrows its search space, reducing spurious edits and improving both precision and convergence speed.

Impact

This platform demonstrates the first end-to-end LLM-based repair loop for real Linux kernel failures.
It shows that structured experimentation and domain-specific agent design can overcome limitations of general-purpose code models in system-level debugging.

Key outcomes:

Reproducing complex kernel bugs automatically and deterministically.
Validating LLM-proposed patches at scale within kGym.
Reducing incorrect edit rates via structured, state-aware search.
Providing a reproducible benchmark for future research on agentic debugging.