LakeAgent

Deep Research over Data Lakes

Posted: November 01, 2025

Tags: Agents

One of the most exciting LLM applications has been deep research agents that search web and unstructured information to synthesize answers. Unfortunately, current deep research agents ignore the abundant tabular datasets in public and enterprise data lakes, leaving them unable to answer analytic questions requiring enumeration, aggregation, or causal reasoning. We propose LakeAgent, a system that builds the missing infrastructure for deep-research agents to operate over terabytes of structured and unstructured data. Given a natural language question, the system automatically discovers relevant datasets, integrates heterogeneous sources, and produces verifiable answers with explicit provenance. For instance, estimating “How likely is Magnus Carlsen to win the next world championship cycle?” requires structured data (match histories, opponent pools) and unstructured data (participation intent, interviews).

Contributors

Haonan Wang*

Jiaxiang Liu*

Tianle Zhou

Eugene Wu

Publications

Suna: Scalable Causal Confounder Discovery over Relational Data

VLDB - 2025
View Publication →
DynoClass: A Dynamic Table-Class Detection System Without the Need for Predefined Ontologies

TRL@NeurIPS - 2024
View Publication →