Back to Projects

LakeAgent

Deep Research over Data Lakes
Posted: November 01, 2025
Tags: Agents
LakeAgent

One of the most exciting LLM applications has been deep research agents that search web and unstructured information to synthesize answers. Unfortunately, current deep research agents ignore the abundant tabular datasets in public and enterprise data lakes, leaving them unable to answer analytic questions requiring enumeration, aggregation, or causal reasoning. We propose LakeAgent, a system that builds the missing infrastructure for deep-research agents to operate over terabytes of structured and unstructured data. Given a natural language question, the system automatically discovers relevant datasets, integrates heterogeneous sources, and produces verifiable answers with explicit provenance. For instance, estimating “How likely is Magnus Carlsen to win the next world championship cycle?” requires structured data (match histories, opponent pools) and unstructured data (participation intent, interviews).

Contributors

  • Haonan Wang*
  • ,
  • Jiaxiang Liu*
  • ,
  • Tianle Zhou
  • ,
  • Eugene Wu

Publications

  • Suna: Scalable Causal Confounder Discovery over Relational Data
    VLDB - 2025
    View Publication →
  • DynoClass: A Dynamic Table-Class Detection System Without the Need for Predefined Ontologies
    TRL@NeurIPS - 2024
    View Publication →