Build an Agent That Thinks Like a Data Scientist: How We Hit #1 on DABStep with Reusable Tool Generation

Community Article Published March 13, 2026

Upvote

nvidia

nvidia

nvidia

The world of data is vast, but quantitative information is often sparse or unavailable in text form online, presenting a significant challenge for deep research agents. This post shares an architecture, NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer, for building autonomous data analysis agents, developed by the NVIDIA Kaggle Grandmasters (KGMON) LLM Agent Research Team. The NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer project introduces an agent specialized for dataset exploration and analysis, designed to handle the complexities of multi-step reasoning, tool calling, and iterative data analysis. Notably, our approach establishes new state-of-the-art (SOTA) performance on the Data Agent Benchmark for Multi-step Reasoning (DABStep) benchmark, ranking 1st place with a 30x speedup over the claude code baseline.

The success of the multi-phase approach on the challenging DABStep benchmark validates the strategy of separating foundational knowledge building from rapid inference.

Motivation: Bridging the Gap in Data Analysis

Deep research agents, especially those relying on internet text search, fall short when dealing with structured, tabular data that requires complex, multi-step queries.

Our core motivation is to create an agent that excels in:

Iterate faster on analysis through automatic code generation and execution.
Crack complex tabular questions with multi-step reasoning and tool use.
Make sense of large unstructured contexts using semantic search.
Stay oriented in experiments by generating and interpreting visualizations automatically.

NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer aims to deliver capabilities including automatic open-ended exploratory data analysis, tabular data Q&A, predictive modeling, and forecasting.

The NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer Architecture

In NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer, we implement different agent loops for different use cases. The architecture leverages the NVIDIA NeMo Agent Toolkit to drive these loops, utilizing tools designed specifically from a data scientist's perspective. For open-ended exploratory data analysis, the system pairs a ReAct agent with a Jupyter Notebook tool, allowing for continuous, bi-directional interaction. Alternatively, for multi-step rule-based tabular data QA, the architecture utilizes a Tool Calling Agent. This agent interacts with a distinct, multi-part suite of specialized tools to accomplish its structured tasks: a stateful Python interpreter, a retriever, and a file structure detector.

Open-ended Exploration and Tabular Data QA

Currently the NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer focuses on two primary applications:

1. Open-ended Exploratory Data Analysis (EDA)

The figure below illustrates the architecture for open-ended exploratory data analysis driven by a ReAct Agent. The workflow begins with the user mounting a dataset and sending questions or instructions to the ReAct Agent, which translates these inputs into specific tool calls. These calls are sent to the Notebook Manipulation Tools, a suite capable of standard operations like creating notebooks, adding code, and running cells. Once the tools execute the commands, the raw output flows into the Tool Output Handler. A critical feature of this handler is its integration with a Vision-Language Model (VLM); if the tool output includes a visual plot, the handler sends it to the VLM to generate a textual description and suggestions for improving the plot's aesthetics and information richness. The handler then replaces the visual plot with this text-based analysis, sending the processed tool output back to the ReAct Agent so it can formulate an informed response to the user.

2. Multi-Step Rule-based Tabular Data QA

This addresses hard questions that require multi-step reasoning and tool calling against a tabular dataset. We focus on Data Agent Benchmark for Multi-step Reasoning (DABStep) benchmark, which comprises 450 total tasks specifically focused on the Financial Payments Sector. The benchmark process is structured into three main components:

The Context & Query include questions and heterogeneous data sources (like CSV and JSON files), alongside a markdown manual detailing domain logic and rules. The Benchmark Tasks categorizes the workload into Easy Tasks (16%), which are basic single-dataset queries, and Hard Tasks (84%), which require complex, multi-step tool-augmented reasoning. These hard tasks involve reading documentation, generating code (such as SQL or Pandas), and cross-referencing data to calculate an answer, where web search offers little to no useful help. Finally, the Evaluation phase measures success using an Exact Text Match with strict formatting requirements, expecting a JSONL output that includes both the agent_answer and the reasoning_trace.

Cracking DABStep: A Multi-Phase Approach

To achieve State-of-the-Art (SOTA) results on DATStep, we need to separate the heavy lifting from the fast execution. The system is split into three distinct phases: a Learning phase where the agent uses general skills and ground truth data to forge reusable, specialized tools; an Inference phase that applies these tools to solve new questions rapidly; and an Offline Reflection phase that reviews the outputs to generate deeper insights. This mimics how a human data scientist operates—spending significant effort upfront to build a robust toolkit so that future tasks become efficient and scalable.

Phase 1: The Learning Loop

In the Learning phase, we deploy a heavyweight model (like Opus 4.5/4.6) in a multi-pass loop equipped with a full arsenal of tools, including a stateful Python interpreter, bash tools, and file structure detectors. By tackling a batch of representative tasks (e.g., Tasks 1 through 10) and validating them against ground truth answers, the agent builds a comprehensive mental model of the dataset. It then synthesizes these individual python scripts into one master solution, ultimately distilling it down to a highly optimized library of reusable functions (helper.py) and a concise set of few-shot examples, which demonstrates how helper functions are used to solve the questions in the dev split (training set).

Recognizing Interconnected Tasks & Optimizing Sub-Solutions Across the Board

The core insight driving this approach is that complex data questions rarely exist in isolation. As shown in the merchant fee examples, different tasks often share the exact same foundational data operations. For instance, computing a specific transaction fee for a specific month (Task 2) requires the exact same initial steps—fetching merchant info and finding fee data—as simply listing the applicable fee IDs (Task 1). Recognizing and mapping this overlap is the key to building a modular, DRY (Don't Repeat Yourself) system.

Instead of writing isolated, brittle scripts for every new question, the agent actively searches for the most robust logic. If "Version 1" of a function works perfectly for Task 1 but fails when applied to the slightly different constraints of Task 2, the agent recognizes the flaw. By actively testing candidate functions via the Python interpreter against the ground truth of multiple interconnected tasks, the agent iteratively discovers a "Version 2" that successfully generalizes across the entire batch.

Refactoring and Packaging

Once the optimal, generalized logic is found, the agent refactors the bulky independent scripts into a clean, unified architecture. The complex data extraction and computation steps are packaged into the centralized helper.py library. Consequently, the actual code needed to answer any specific question shrinks dramatically. The final task solutions transform from long, complex scripts into lightweight instructions that simply import and execute the right tools from the helper library.

Phase 2: Fast and Lean Inference

With the foundational code written, the Inference phase shifts to a smaller, faster model (like Haiku 4.5) running a single-pass loop. Because the complex domain logic is already securely housed in helper.py, the inference agent only needs a basic Python interpreter to do its job. To keep token costs and latency to an absolute minimum, the context window is aggressively pruned: the agent is fed only the function signatures (not the underlying code) alongside a streamlined system prompt, allowing it to efficiently orchestrate the pre-built tools to solve unseen tasks.

Phase 3: Unsupervised Offline Reflection

To ensure high quality without bottlenecking the live inference loop, we move critical quality control entirely offline. This phase relies on two powerful LLM evaluation techniques—reflection and group-consistency—driven by a heavyweight model (like Opus or Sonnet 4.6) acting as an unsupervised reviewer.

Reflection is the process where the model looks back at the agent's generated code and reasoning to audit its performance. It asks the tough questions: Did the agent effectively utilize the helper.py library? Did it follow the prompt faithfully? Are there any obvious mistakes in the code?

Group-consistency, on the other hand, involves analyzing multiple candidate solutions across groups of similar test questions to ensure the agent's logic remains stable. If the agent solves the exact same type of question using conflicting methods, the offline model flags the discrepancy and reasons through which approach is actually correct. By moving these computationally heavy checks offline, we can deeply analyze the data without sacrificing the speed of the Inference phase.

Closing the Loop: Injecting Insights for Faster Inference

The insights generated during this offline reflection aren't just for analytics—they are actively fed back into the architecture to close the learning loop. By extracting key patterns, edge cases, and potential pitfalls from the test data, the heavy model compiles these learnings and injects them directly into the system prompt for future Inference phases. Because the lightweight inference agent already holds these pre-calculated insights in its starting prompt, we completely eliminate the need for slow, computationally expensive online reflection or consistency checks. The result is an Inference phase that remains blazingly fast and token-efficient, while continuously compounding its accuracy with every offline review.

Results

	Easy	Hard	Time/Task	Code Length
NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer + haiku 4.5	87.5	89.95	20s	1870
claude code + opus 4.5	90.2	66.93	10min	5011
DataPilot from AntGroup	86.11	87.57	unknown	unknown
DS-STAR from Google AI	87.5	45.24	unknown	unknown

To validate this architecture, we benchmarked our three-phase "NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer" approach (using the lightweight Haiku 4.5 for inference) against a standard baseline using "Claude Code" with the heavyweight Opus 4.5, which attempts to solve every task from scratch. The results highlight the massive efficiency gains of our methodology. Because our inference agent relies on the pre-built helper.py library, it solves tasks at blazing speed—taking only 20 seconds per task and generating a highly concise 1,870 characters. In stark contrast, the from-scratch approach takes a painstaking 10 minutes per task and bloats the code length to 5,011 chars. Most impressively, this 30x speedup doesn't compromise complex reasoning. While the heavy Opus model slightly edged out on "Easy" tasks (90.2 vs. 87.5), our approach completely dominated the "Hard" tasks, scoring an 89.95 compared to the baseline's 66.93. This proves that investing time in upfront learning and code abstraction allows even smaller, faster models to outsmart heavier models on complex, multi-step problems.

This performance secured our architecture 1st place on the official dabstep leaderboard. The NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer approach significantly outperformed AntGroup's DataPilot and Google AI's DS-STAR on complex problems. With a score of 89.95 on "Hard" tasks, our system surpassed DataPilot (87.57) and nearly doubled DS-STAR's score (45.24). Given that 84% of the benchmark consists of hard-level tasks, our dominance in this category directly secures our position as the best overall solution. These results establish our three-phase methodology as the current state-of-the-art for both efficient and rigorous tabular reasoning.

Conclusion: A New Paradigm for Data-Intensive Research

Building on top of NVIDIA NeMo Agent Toolkit, the Data Explorer agent represents a significant step forward in automated data analysis for structured tabular data. By employing flexible agent loops—a ReAct loop for open-ended exploratory data analysis and a multi-phase system for rule-based tabular QA—the agent is uniquely positioned to handle complex, multi-step reasoning tasks. The success of the multi-phase approach on the challenging DABStep benchmark, particularly the proactive learning loop that generates reusable, generalized functions, validates the strategy of separating foundational knowledge building from rapid inference. Data Explorer moves beyond simple query-answering to embody the operational workflow of a seasoned data scientist, delivering scalable, high-quality insights and establishing a new paradigm for data-intensive research driven by LLM-powered agents.

Ready to build your own data exploration agent? Get started with NVIDIA Launchable. Examples will be released soon!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote