LakeAgent
One of the most exciting LLM applications has been deep research agents that search web and unstructured information to synthesize answers. Unfortunately, current deep research agents ignore the abundant tabular datasets in public and enterprise data lakes, leaving them unable to answer analytic questions requiring enumeration, aggregation, or causal reasoning. We propose LakeAgent, a system that builds the missing infrastructure for deep-research agents to operate over terabytes of structured and unstructured data. Given a natural language question, the system automatically discovers relevant datasets, integrates heterogeneous sources, and produces verifiable answers with explicit provenance. For instance, estimating “How likely is Magnus Carlsen to win the next world championship cycle?” requires structured data (match histories, opponent pools) and unstructured data (participation intent, interviews).
Contributors
- Haonan Wang* ,
- Jiaxiang Liu* ,
- Tianle Zhou ,
- Eugene Wu
Publications
-
Suna: Scalable Causal Confounder Discovery over Relational DataVLDB - 2025View Publication →
-
DynoClass: A Dynamic Table-Class Detection System Without the Need for Predefined OntologiesTRL@NeurIPS - 2024View Publication →