Columbia DAPLab at ICML 2026

DAPLab affiliates have a strong presence at ICML 2026 (July 6–11, Seoul, South Korea), with papers across the main conference, posters, and workshops. The work spans agents, data systems, reliable AI, human-AI interaction, digital twins, model communication, and uncertainty.

The work reflects a core theme of the lab: building the data, systems, and interaction foundations needed for AI systems that can operate reliably in messy, real-world settings. This includes new benchmarks for exploratory question answering over million-scale data lakes, live kernel crash repair, adaptive querying, digital twin simulation, confidence calibration, and model-to-model communication without text.

Together, these papers show how DAPLab research connects machine learning advances to the infrastructure required to use AI safely and effectively: better environments for evaluating agents, better methods for eliciting and adapting to human or AI-generated information, better abstractions for cross-model communication, and better tools for understanding uncertainty.

Agents, Benchmarks, and Data Systems

Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All Project Page ↗ Chenxi Huang, Alex Mathai, Feiyang Yu, Aleksandr Nogikh, Petros Maniatis, Franjo Ivancic, Eugene Wu, Kostis Kaffes, Junfeng Yang, Baishakhi Ray Main ICML 2026

LLM training cutoffs are a fundamental barrier for evaluating agents on real-world software tasks: any benchmark frozen at a point in time leaks into pretraining data, inflating apparent performance. This paper introduces a live benchmark for kernel crash resolution that continuously draws from newly reported crashes, keeping evaluation fresh and contamination-free. It tests whether agents can diagnose and resolve real, post-cutoff Linux kernel failures—a task that demands up-to-date systems knowledge, tool use, and multi-step reasoning.

LAKEQA: An Exploratory QA Benchmark over a Million-Scale Data Lake Project Page ↗ Haonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto Veizaga, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu Main ICML 2026

Deep research agents excel at synthesizing unstructured web content but largely ignore the structured, tabular datasets sitting in enterprise and public data lakes. LAKEQA is a benchmark for exploratory question answering over a million-scale data lake, requiring agents to discover relevant datasets, reason across heterogeneous sources, and produce answers with explicit provenance. It fills a critical gap between retrieval benchmarks and the analytic, enumeration-heavy questions that arise in real data-driven workflows.

Model Representations, Communication, and Efficient Adaptation

Fixed Universal Transformers Jingwen Liu, Alexandr Andoni, Daniel Hsu Mechanistic Interpretability Workshop at ICML 2026

This paper introduces universal transformers: fixed transformers that can simulate any transformer in a given class through a suitable input embedding, while keeping all internal parameters fixed. It provides explicit sparse constructions, shows that randomly initialized transformers are universal almost surely, and validates the theory on algorithmic tasks such as parenthesis balancing and multi-hop reasoning.

Latent Cache Flow: Model-to-Model Communication Without Text Maximillian Rossi, Prajwal Raghunath, Eugene Wu AdaptFM Workshop at ICML 2026

When AI systems chain multiple models together, inter-model communication almost always passes through natural language—a lossy, token-expensive bottleneck. Latent Cache Flow proposes a direct alternative: reusing KV cache representations to pass context between models without text serialization. This enables richer, more efficient model-to-model communication and opens new design space for compound AI systems and agent pipelines.

Adaptive Learning, Querying, Elicitation, and Decision Making

Adaptive Querying with AI Persona Priors Kaizheng Wang, Yuhang Wu, Assaf Zeevi Main ICML 2026; ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation

Designing adaptive surveys and queries requires knowing something about the distribution of respondents before any responses are collected. This paper introduces a framework that uses AI persona priors—model-generated beliefs about likely respondent types—to warm-start adaptive querying, reducing the number of questions needed to elicit accurate information from diverse populations.

Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions Ruomeng Ding, Tianwei Gao, Thomas P. Zollo, Eitan Bachmat, Richard Zemel, Zhun Deng Main ICML 2026

Effectively gathering information from a group requires deciding not just what to ask, but whom to ask and when. This paper studies adaptive group elicitation via multi-turn LLM interactions, developing strategies that identify which group members hold the most decision-relevant information and route questions accordingly—reducing redundancy and improving the quality of collective decisions.

Few-Shot Design Optimization by Exploiting Auxiliary Information Arjun Mani, Carl Vondrick, Richard Zemel Main ICML 2026

Design optimization problems often come with limited labeled examples but abundant auxiliary signals. This paper shows how to exploit auxiliary information to dramatically reduce the sample complexity of few-shot design optimization, enabling effective search in settings where direct labeled data is scarce but related structure is available.

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback Ved Sriraman, Peihan Liu, Daniel Hsu, Adam Block ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation

This paper studies imitation learning with noisy expert feedback, where the learner observes an imperfect version of an expert policy but aims to compete with the reward of a clean expert. It shows a sharp separation between offline and online imitation learning: offline learning from noisy trajectories can require sample complexity that grows exponentially with the horizon, while online interaction with the noisy expert through a variant of on-policy distillation can achieve polynomial horizon dependence.

Digital Twins and Simulation

SYN-DIGITS: A Synthetic Control Framework for Calibrated Digital Twin Simulation Project Page ↗ Grace Jiarui Fan, Chengpiao Huang, Tianyi Peng, Kaizheng Wang, Yuhang Wu ICML 2026 Workshop on Connecting Low-rank Representations in AI

Digital twins promise accurate simulation of real-world systems, but simulation fidelity depends critically on how well the twin is calibrated to observed data. SYN-DIGITS introduces a synthetic control framework that systematically calibrates digital twin simulations, combining synthetic data generation with statistical control to reduce the gap between simulated and real-world behavior.

AI Safety, Robustness, and Security

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs Guangyu Shen, Siyuan Cheng, Xiangzhe Xu, Yuan Zhou, Hanxi Guo, Zhuo Zhang, Xiangyu Zhang Main ICML 2026

Backdoor attacks embed hidden triggers into model weights during training, causing models to misbehave on attacker-specified inputs while appearing normal otherwise. This paper introduces a new defense paradigm: rather than detecting and removing backdoors from the outside, it teaches models to recognize when they are being triggered—fostering self-awareness about poisoning and enabling models to flag or suppress compromised behavior at inference time.

Uncertainty, Calibration, and Reliable Decision Support

Estimating Tail Risks in Language Model Output Distributions Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He Main ICML 2026 Spotlight

Despite significant alignment advances, language models deployed at population scale will inevitably produce harmful outputs in rare but consequential tail cases that current safety evaluations—focused on input distributions rather than probabilistic tail behavior—fail to capture. We address this gap by proposing an importance-sampling-based method that estimates the probability of harmful outputs for any input query by constructing unsafe model variants that make harmful outputs more probable, enabling sample-efficient tail risk measurement without brute-force sampling.

Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation Thomas P. Zollo, Jimmy Wang, Richard Zemel ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

Knowing when to trust a model’s output is essential for safe deployment. This paper develops an unsupervised method for calibrating the confidence of reasoning LLMs using only a single generation—no labeled data, no ensemble, no multiple samples. The approach makes calibration practical in settings where ground-truth labels or repeated sampling are unavailable.

Confidence Calibration in Vision-Language-Action Models Thomas P. Zollo, Richard Zemel ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

Vision-language-action (VLA) models drive embodied agents by mapping visual observations and language instructions to motor actions. Poor confidence calibration in these models leads to overconfident, brittle behavior in deployment. This paper investigates calibration methods tailored to VLA models, working toward agents that know what they don’t know and act accordingly.

We are excited to see this work appear across ICML 2026 in Seoul and to continue developing the foundations for dependable AI systems. If you are at ICML and want to connect, reach out at daplab@cs.columbia.edu.