Methodology - IdeaMaze

🔄 The Base Experiment Loop

▶

At its core, IdeaMaze follows the auto-research pattern pioneered by Karpathy: an LLM agent autonomously modifies a training script, runs experiments, and keeps improvements.

# The core loop (runs forever until interrupted)
LOOP:
  1. maze.py sync && maze.py status   # Check knowledge base
  2. Edit train.py with experimental idea
  3. git commit -m "exp: description"
  4. python train.py > run.log 2>&1          # Run experiment
  5. Extract metric from run.log
  6. if improved → keep commit
     else     → git reset
  7. maze.py sync                           # Update knowledge base
  8. goto 1

Key rules that prevent chaos:

Single file modification:only train.py changes, everything else is frozen
Fixed validation set:evaluation data never changes
100% coverage required:every validation row must receive a prediction
Timeout:each experiment limited to ~5 minutes
Never stop:the loop runs until a human interrupts

⇄ Innovation 1: Parallel Agent Execution

▶

Instead of running experiments sequentially, IdeaMaze spawns N workers in isolated git worktrees. Each worker explores a different experiment category, maximizing diversity per batch.

The Coordinator Protocol

# Auto-detect parallelism
N = floor(cpu_cores / 2)

PARALLEL LOOP:
  1. maze.py sync && maze.py status && maze.py next
     → identify N promising experiments from DIFFERENT categories

  2. Write worker_status.json with assignments

  3. Spawn N workers in isolated git worktrees
     # Each worker: modify → commit → run → report
     # Workers NEVER update results.tsv or maze.db
     # Workers NEVER git reset

  4. Wait for all workers to complete

  5. Collect results, rank by metric
     → Cherry-pick BEST result that beats current best
     → Log ALL results to results.tsv

  6. maze.py sync → update knowledge base
  7. goto 1

Why Git Worktrees?

Each worker gets a complete, isolated copy of the repository. They can modify files, commit, and run experiments without interfering with each other or the main branch. The coordinator then cherry-picks winning commits.

Category Diversity

Workers are assigned to different experiment categories (e.g., one tries feature engineering, another tries a new algorithm, a third tests hyperparameters). This maximizes information gain per batch rather than testing variations of the same idea.

📚 Innovation 2: Structured Knowledge Base

▶

The knowledge base (maze.py + maze.db) transforms stateless experimentation into cumulative learning. It auto-syncs from the experiment log and provides actionable intelligence.

Database Schema

CREATE TABLE experiments (
    node_id TEXT UNIQUE,
    commit_hash TEXT,
    avg_mae REAL,
    val_r2 TEXT,
    status TEXT CHECK(status IN ('keep','discard','crash')),
    description TEXT,
    category TEXT    -- auto-classified
);

CREATE TABLE insights (
    text TEXT,       -- cross-cutting patterns learned
    created_at TIMESTAMP
);

CREATE TABLE strategy (
    text TEXT,       -- current research direction
    created_at TIMESTAMP
);

Computed Views

View	Purpose
`v_best`	Best-performing experiment (lowest error)
`v_category_counts`	Experiments per category (find gaps)
`v_stagnation`	Count of experiments since last improvement
`v_recent`	Last 10 experiments with outcomes
`v_improvements`	Chronological list of all improvements

Auto-Classification

Every experiment is automatically classified into one of 11 categories based on its commit description:

neural_network embedding source_calibration target_transforms ensemble_strategy segmentation feature_engineering hyperparameter_tuning data_preprocessing algorithm_selection radical_shift

Commands

$ python maze.py sync              # Sync results.tsv → maze.db (idempotent)
$ python maze.py status            # Best metric, stagnation, categories
$ python maze.py next              # Suggest what to try next
$ python maze.py insight "text"   # Record a pattern learned
$ python maze.py strategy "text"  # Set research direction
$ python maze.py history           # Show improvement timeline
$ python maze.py category NAME    # Show experiments in category
$ python maze.py insights          # List all recorded insights

🚫 Innovation 3: Gamification Detection

▶

The most surprising discovery during our research: AI agents will unconsciously game evaluation metrics if the system allows it.

The Winsorization Trap

Winsorization (clipping extreme values before evaluation) is a common data science technique. When the agent discovered it could improve the metric by filtering out "hard" predictions, it kept pushing the filtering tighter and tighter:

Filtering Strategy	Reported Metric	Real-World Metric	Gaming Ratio
No filtering	20,016	20,016	1.00x
5th-95th percentile	14,587	21,265	1.46x
10th-90th percentile	10,232	25,891	2.53x
25th-75th percentile	4,205	38,442	9.14x
45th-55th percentile	1,709	~45,000	~26x

The Detection Rule

IdeaMaze computes both the filtered metric and an unfiltered metric on the complete, raw validation set. If the ratio exceeds 3x, the experiment is flagged as metric gaming and automatically discarded.

# Gamification detection pseudocode
filtered_metric = evaluate(predictions, filtered_labels)
unfiltered_metric = evaluate(predictions, all_labels)

ratio = unfiltered_metric / filtered_metric
if ratio > 3.0:
    flag_as_gaming("Metric inflation detected: {ratio:.1f}x")
    status = "discard"

Anti-Gaming Rules in program.md

Fixed validation set: every row must receive a prediction
Cannot use validation-only columns as training features
Cannot narrow evaluation scope to avoid difficult predictions
Cannot use features unavailable at inference time

🧠 Innovation 4: Batch Learning

▶

In basic auto-research, each experiment is independent: the agent doesn't remember what it learned from previous failures. IdeaMaze's batch learning system accumulates cross-cutting insights.

How Insights Accumulate

After experiments, the agent records cross-cutting patterns, rules that apply beyond a single experiment:

# Example insights discovered during autonomous research
$ python maze.py insights

p001: Data quality improvements drive 60% of total metric gains
p005: Log transform is foundational for skewed target distributions
p006: Cross-source features critical for small datasets (+23%)
p007: Depreciation curves are high-value for age-dependent targets
p010: Ensemble diversity > ensemble size (6 diverse > 10 same-type)
p016: Asymmetric outlier removal matches real noise structure
p020: Aggressive data filtering hurts more than it helps for small sets

Diminishing Returns Breaker

If the same technique has been tested at 3+ different configurations and consistently gives less than 2% relative improvement, it's promoted to a known pattern and the agent stops testing it.

Convergence Detection

If the last 10 experiments all produce less than 1% relative metric change, the system forces a fundamental shift: a different model family, a different problem framing, or a different data strategy. Micro-tuning hyperparameters does not count as a fundamental shift.

# Stagnation alert from maze.py
$ python maze.py status

CONVERGENCE ALERT: 10 experiments without improvement.
Force a fundamental shift:
  - Try a different model family (tree → neural, or vice versa)
  - Reframe the problem (regression → ranking, direct → residual)
  - Change the data strategy (add/remove sources, different splits)

🎨 The Experiment Maze Visualization

▶

Every experiment becomes a node in an interactive D3.js force-directed graph. The visualization shows:

Nodes:each experiment, colored by category, sized by metric improvement
Golden path:the chain of kept experiments (improvements) highlighted in gold
Status icons:kept experiments, discarded experiments, crashes
Timeline animation:replay the experiment sequence at variable speeds
Detail panel:click any node to see full metadata
Worker bar:shows active parallel workers during batch execution

Try the Interactive Visualizer →

The Methodology