Projects

PALs PALs
Process-Aware Learning Agents for Human-AI Alignment
AI Agents can generate impressive outputs - code, slide decks, documents – but they fail to follow the trust processes that humans rely on. Everyone has an established workflow that is honed after years of practice. If agents cannot understand or adapt to these workflows, people are forced to change how they work to use AI, which causes friction, mistrust, and low adoption. How can we get agents to not just have good outputs but follow trusted processes that are already known to a user? We propose building process-aware agents that can learn and adapt to a user’s workflow. Our approach is to gather version history across tools like Google Docs, Google Slides, Github commit histories, and chat logs. We will then use process mining techniques to extract workflows, like iteration cycles and checkpoints, to represent the process as a structured workflow graph. Agents will follow this mined workflow, ask clarifying questions, and adapt on-the-fly to the user’s working style in order to build a collaborative partnership over the course of a long-term project. We aim to show that process-aware agents are more trusted and produce outputs of code, slides, and documents that are personalized to that person’s working style, demonstrating that AI can adapt to people, rather than forcing people to adapt to AI.
Workflow-alignment Human Computer Interaction Agentic Systems Code Agents
LumOS: LLM Agents for Always‑On OS Tuning LumOS: LLM Agents for Always‑On OS Tuning
Safe, explainable control loops that continuously optimize Linux in real time
Overview LumOS builds an expert‑in‑residence agent that safely tunes Linux kernel schedulers online. The agent reasons over live telemetry, proposes settings, and applies changes with guards (transactional apply/commit/revert, approvals), delivering faster convergence and lower tail latency than classical tuners or manual tuning, and adapting quickly to workload shifts. Why it matters Traditional BO/RL tuners often explore blindly, need brittle reward engineering, and adapt slowly. By emulating human expert reasoning, the agent interprets system state, chooses safe steps, and explains what it’s doing—making OS auto‑tuning governable and auditable in production environments. What’s new here LLM‑guided control loop: An LLM proposes scheduler settings (e.g., min_granularity_ns, latency_ns) from structured telemetry and trend summaries; host‑side guardrails ensure safety (typed tools, policy, approval gates, apply/commit/revert). Zero‑mod deployment path: A userspace agent learns from existing counters (perf, /proc, /sys)—no app changes and no kernel patches. It avoids brittle single‑metric proxies by using pairwise ranking and archetype selection for robust decisions across contexts. Latency‑aware speculation (optional): A fast speculator makes immediate, reversible OS adjustments while the slower actor deliberates, improving reaction time without sacrificing correctness (last‑write‑wins). Results at a glance Beats classical tuners & human expert on Linux CFS tuning – −5.0% p99 vs. Bayesian (1‑param), −7.1% vs. Bayesian (2‑param), and −2.98% vs. a kernel‑savvy human overall; converges faster and adapts smoothly to rate changes. Robust with zero app instrumentation – Single‑metric proxies (e.g., IPC) are brittle across workloads; pairwise ranking over diversified counters stays predictive even under noise and antagonists. Faster reaction via speculation – During recovery, p95 latency ~37.9 ms with actor+speculator vs. 54.0 ms actor‑only (untuned ~103 ms); convergence in ~10–15 s vs. ~200 s for actor‑only. How it works Observe: Collect a compact performance signature from system counters (CPU/core, scheduler, memory/VM, LLC, I/O, network). Reason: LLM expert: summarizes trends, proposes safe next settings; guardrails (MCP‑style tool schemas, policy, approvals) keep actions auditable and reversible. Zero‑mod ML: if app metrics are absent, a pairwise ranker selects better configs based on the signature; archetype gating picks a specialized ranker or falls back to a global one. Act: Apply staged changes, measure, and commit/rollback; optionally let a fast speculator react every second while the actor finalizes. Safety & governance We adopt MCP‑style interfaces: discoverable, typed tools; semantic validation (units/ranges/cross‑field checks); two‑phase apply–commit–revert; policy/approval gates; and structured audit logs for forensics and replay. Broader thrusts Beyond OS control, we generalize speculative actions to agentic environments (gameplay, e‑commerce, web QA): a fast model predicts likely next steps while authoritative components verify. This lossless pattern yields substantial wall‑clock savings—up to ~30% time reduction in end‑to‑end agent runs under representative settings—and integrates cleanly with tool- and human‑in‑the‑loop pipelines.
Operating Systems LLM Agents Scheduling Auto‑tuning MCP
Cortex Cortex
Workflow-Aware Resource Pooling and Scheduling for Agentic Serving
Cortex is a prototype workflow-aware serving platform designed for agentic workloads. The core principle of Cortex is stage isolation: it provisions dedicated resource pools for each distinct stage of an agentic workflow. This simple yet powerful strategy mitigates inter-stage interference in compute and memory, leading to better KV cache utilization, higher throughput, and more predictable performance. By customizing resource allocation and scheduling within each distinct stage of agentic workflows, Cortex lays the groundwork for more advanced, agent-native serving paradigms, including malleable resource management, speculative execution of workflow branches, and a shared, multi-tiered cache for “agentic state.”
Workflow-awareness Agentic orchestration Stage isolation
SAGE SAGE
A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation
Description Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users’ information needs and expectations in a company’s target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.
User Simulator Agent Evaluation
ACONIC ACONIC
Design your workflow with CONstraint-Induced Complexity
Large Language Models (LLMs) suffer from reliability issues on complex tasks, as existing decomposition methods are heuristic and rely on agent or manual decomposition. This work introduces a novel, systematic decomposition framework that we call Analysis of CONstraint-Induced Complexity (ACONIC), which models the task as a constraint problem and leveraging formal complexity measures to guide decomposition. On combinatorial (SATBench) and LLM database querying tasks (Spider), we find that by decomposing the tasks following the measure of complexity, agent can perform considerably better (10-40 percentage point).
Workflow Agents
AgentDynEx AgentDynEx
Nudging the Mechanics and Dynamics of Multi-Agent Simulations
Multi-agent large language model simulations have the potential to model complex human behaviors and interactions. If the mechanics are set up properly, unanticipated and valuable social dynamics can surface. However, it is challenging to consistently enforce simulation mechanics while still allowing for notable and emergent dynamics. We present AgentDynEx, an AI system that helps set up simulations from user-specified mechanics and dynamics. AgentDynEx uses LLMs to guide users through a Configuration Matrix to identify core mechanics and define milestones to track dynamics. It also introduces a method called \textit{nudging}, where the system dynamically reflects on simulation progress and gently intervenes if it begins to deviate from intended outcomes. A technical evaluation found that nudging enables simulations to have more complex mechanics and maintain its notable dynamics compared to simulations without nudging. We discuss the importance of nudging as a technique for balancing mechanics and dynamics of multi-agent simulations.
Multi-Agent Simulations Nudging Generative Agents
Data Flow Control Data Flow Control
Empowering agents by controlling data flow
The promise of Large Language Model (LLM) agents is to perform complex, stateful tasks. This promise is stunted by significant risks -— policy violations, process corruption, and prompt injection -— that stem from the lack of visibility and mechanisms to manage undesirable data flows produced by agent actions. Today, agent workflows are responsible for enforcing these policies in ad hoc ways. Just as data validation and access controls shifted from the application to the DBMS, freeing application developers from these concerns, we argue that systems should support Data Flow Controls (DFCs) and enforce DFC policies natively.
Agents Governance Privacy Security
checkpoint-lite checkpoint-lite
A minimal checkpoint/restore tool using CRIU and OverlayFS for fast process state management.
checkpoint-lite provides a simple interface to checkpoint and restore running processes while capturing both their memory state and filesystem changes. Unlike heavyweight container solutions, this tool focuses on minimal overhead by directly orchestrating existing kernel features. Key Features Hybrid State Capture: Combines filesystem (OverlayFS) and memory (CRIU) checkpointing Multi-Session Support: Concurrent usage by multiple applications with isolated sessions Minimal Overhead: Direct system calls without unnecessary container abstractions Simple CLI: Straightforward command-line interface for checkpoint operations Session Management: Automatic cleanup and resource management
Checkpoint Systems AI
stracetools stracetools
A modern Python library for parsing, analyzing, and visualizing strace output with ease.
System debugging and performance analysis often rely on strace to understand application behavior. However, existing tools typically fall short: Limited scope: Most tools only provide basic statistics or file access lists No programmability: Fixed output formats with no API for custom analysis Poor multi-threading support: Difficult to analyze concurrent syscall execution No visualization: Raw text output is hard to interpret for complex applications StraceTools bridges these gaps by providing: ✨ Comprehensive parsing with full syscall detail extraction 🔧 Programmable API for custom analysis workflows 📊 Interactive visualizations for timeline and process analysis 🧵 Multi-threading support with process relationship tracking
Tracing Systems Python
kGym & kAgent kGym & kAgent
A Platform and Agent to patch Linux kernel bugs
Overview Debugging the Linux kernel remains one of the most challenging problems in systems software. With over 20 million lines of code, thousands of contributors, and hardware-level concurrency, crashes can be subtle, nondeterministic, and cryptic to interpret. Traditional debugging techniques often fail to scale across this diversity and complexity. To address this, we built an agentic framework that leverages Large Language Models (LLMs) to reason about, reproduce, and fix kernel-level bugs. This ecosystem — kBench, kGym, and kAgent — integrates structured datasets, experimental automation, and intelligent reasoning to close the loop between bug observation and repair. Components Name    Description kBench A curated benchmark of Linux kernel bugs, each paired with developer-provided fixes and deterministic reproduction scripts. Enables systematic evaluation of patching strategies. kGym A sandboxed, large-scale kernel experimentation platform capable of booting and testing thousands of kernel configurations in parallel. Provides execution traces, crash states, and verification environments for patches. kAgent An LLM-based autonomous agent that runs experiments in kGym, interprets crash logs, hypothesizes code changes, and iteratively validates patches until a verified fix is achieved. Approach The system adopts a hypothesis-driven debugging workflow: Observe a kernel crash and extract structured traces and logs. Generate hypotheses for potential fault locations using LLM reasoning. Propose and apply plausible patches to the codebase. Validate candidate patches within kGym until the crash is fully resolved. Through iterative feedback between execution results and model reasoning, kAgent narrows its search space, reducing spurious edits and improving both precision and convergence speed. Impact This platform demonstrates the first end-to-end LLM-based repair loop for real Linux kernel failures. It shows that structured experimentation and domain-specific agent design can overcome limitations of general-purpose code models in system-level debugging. Key outcomes: Reproducing complex kernel bugs automatically and deterministically. Validating LLM-proposed patches at scale within kGym. Reducing incorrect edit rates via structured, state-aware search. Providing a reproducible benchmark for future research on agentic debugging.
Linux Kernel LLM Agents Automated Crash Repair Systems

No projects found

Try adjusting your search terms