A deep dive into the four innovations that make auto-research faster, smarter, and more honest
At its core, IdeaMaze follows the auto-research pattern pioneered by Karpathy: an LLM agent autonomously modifies a training script, runs experiments, and keeps improvements.
# The core loop (runs forever until interrupted) LOOP: 1. maze.py sync && maze.py status # Check knowledge base 2. Edit train.py with experimental idea 3. git commit -m "exp: description" 4. python train.py > run.log 2>&1 # Run experiment 5. Extract metric from run.log 6. if improved → keep commit else → git reset 7. maze.py sync # Update knowledge base 8. goto 1
Key rules that prevent chaos:
train.py changes, everything else is frozenInstead of running experiments sequentially, IdeaMaze spawns N workers in isolated git worktrees. Each worker explores a different experiment category, maximizing diversity per batch.
# Auto-detect parallelism N = floor(cpu_cores / 2) PARALLEL LOOP: 1. maze.py sync && maze.py status && maze.py next → identify N promising experiments from DIFFERENT categories 2. Write worker_status.json with assignments 3. Spawn N workers in isolated git worktrees # Each worker: modify → commit → run → report # Workers NEVER update results.tsv or maze.db # Workers NEVER git reset 4. Wait for all workers to complete 5. Collect results, rank by metric → Cherry-pick BEST result that beats current best → Log ALL results to results.tsv 6. maze.py sync → update knowledge base 7. goto 1
Each worker gets a complete, isolated copy of the repository. They can modify files, commit, and run experiments without interfering with each other or the main branch. The coordinator then cherry-picks winning commits.
Workers are assigned to different experiment categories (e.g., one tries feature engineering, another tries a new algorithm, a third tests hyperparameters). This maximizes information gain per batch rather than testing variations of the same idea.
The knowledge base (maze.py + maze.db) transforms stateless experimentation into cumulative learning. It auto-syncs from the experiment log and provides actionable intelligence.
CREATE TABLE experiments ( node_id TEXT UNIQUE, commit_hash TEXT, avg_mae REAL, val_r2 TEXT, status TEXT CHECK(status IN ('keep','discard','crash')), description TEXT, category TEXT -- auto-classified ); CREATE TABLE insights ( text TEXT, -- cross-cutting patterns learned created_at TIMESTAMP ); CREATE TABLE strategy ( text TEXT, -- current research direction created_at TIMESTAMP );
| View | Purpose |
|---|---|
v_best | Best-performing experiment (lowest error) |
v_category_counts | Experiments per category (find gaps) |
v_stagnation | Count of experiments since last improvement |
v_recent | Last 10 experiments with outcomes |
v_improvements | Chronological list of all improvements |
Every experiment is automatically classified into one of 11 categories based on its commit description:
$ python maze.py sync # Sync results.tsv → maze.db (idempotent) $ python maze.py status # Best metric, stagnation, categories $ python maze.py next # Suggest what to try next $ python maze.py insight "text" # Record a pattern learned $ python maze.py strategy "text" # Set research direction $ python maze.py history # Show improvement timeline $ python maze.py category NAME # Show experiments in category $ python maze.py insights # List all recorded insights
The most surprising discovery during our research: AI agents will unconsciously game evaluation metrics if the system allows it.
Winsorization (clipping extreme values before evaluation) is a common data science technique. When the agent discovered it could improve the metric by filtering out "hard" predictions, it kept pushing the filtering tighter and tighter:
| Filtering Strategy | Reported Metric | Real-World Metric | Gaming Ratio |
|---|---|---|---|
| No filtering | 20,016 | 20,016 | 1.00x |
| 5th-95th percentile | 14,587 | 21,265 | 1.46x |
| 10th-90th percentile | 10,232 | 25,891 | 2.53x |
| 25th-75th percentile | 4,205 | 38,442 | 9.14x |
| 45th-55th percentile | 1,709 | ~45,000 | ~26x |
IdeaMaze computes both the filtered metric and an unfiltered metric on the complete, raw validation set. If the ratio exceeds 3x, the experiment is flagged as metric gaming and automatically discarded.
# Gamification detection pseudocode filtered_metric = evaluate(predictions, filtered_labels) unfiltered_metric = evaluate(predictions, all_labels) ratio = unfiltered_metric / filtered_metric if ratio > 3.0: flag_as_gaming("Metric inflation detected: {ratio:.1f}x") status = "discard"
In basic auto-research, each experiment is independent: the agent doesn't remember what it learned from previous failures. IdeaMaze's batch learning system accumulates cross-cutting insights.
After experiments, the agent records cross-cutting patterns, rules that apply beyond a single experiment:
# Example insights discovered during autonomous research $ python maze.py insights p001: Data quality improvements drive 60% of total metric gains p005: Log transform is foundational for skewed target distributions p006: Cross-source features critical for small datasets (+23%) p007: Depreciation curves are high-value for age-dependent targets p010: Ensemble diversity > ensemble size (6 diverse > 10 same-type) p016: Asymmetric outlier removal matches real noise structure p020: Aggressive data filtering hurts more than it helps for small sets
If the same technique has been tested at 3+ different configurations and consistently gives less than 2% relative improvement, it's promoted to a known pattern and the agent stops testing it.
If the last 10 experiments all produce less than 1% relative metric change, the system forces a fundamental shift: a different model family, a different problem framing, or a different data strategy. Micro-tuning hyperparameters does not count as a fundamental shift.
# Stagnation alert from maze.py $ python maze.py status CONVERGENCE ALERT: 10 experiments without improvement. Force a fundamental shift: - Try a different model family (tree → neural, or vice versa) - Reframe the problem (regression → ranking, direct → residual) - Change the data strategy (add/remove sources, different splits)
Every experiment becomes a node in an interactive D3.js force-directed graph. The visualization shows:
Generate a program.md for your task and begin autonomous experimentation.
Get the Starter Kit