Skip to content
Research → Production

The first model trained to
orchestrate, not just code

Every AI tool asks “What code should I write?” BrainstormLLM asks “How should this feature be built?” — and routes each phase to the right model at the right cost.

Trained on real sessions from Brainstorm CLI. Deployed inside BrainstormRouter. Powers every platform in the portfolio.

0.796

Mean F1 Score

2,203

Training Trajectories

<2ms

ONNX Inference

68%

Cost Reduction

0.587

Baseline (Plan Cache)

/ The Insight(01)

Not every task needs every phase

A simple bug fix doesn't need architecture or documentation. A new feature needs all 9 phases. A refactor skips specification but needs verification. Every AI coding tool today treats all tasks the same — send the full prompt to one model and hope for the best.

BrainstormLLM predicts which phases a task requires, in what order, and which model should handle each one. The result: 68% cost reduction compared to running every task through the full pipeline with a quality model.

The prediction runs in under 2 milliseconds (ONNX) inside BrainstormRouter's hot path. By the time the first model receives the prompt, the optimal pipeline is already planned.

Predicted phase plans

TaskspearcimprevverrefdepdocrepCost
Fix the null pointer in auth.ts$0.003
Add JWT middleware to all API routes$0.008
Design the notification system$0.02
Refactor the database layer to use connection pooling$0.015
Build the complete checkout flow with Stripe$0.05

Amber = phase included. Gray = skipped. Cost = estimated per task.

/ The 9-Phase Pipeline(02)
01

Specification

Define requirements, constraints, acceptance criteria

e.g. New feature, API contract

02

Architecture

Design system structure, interfaces, data flow

e.g. New service, major refactor

03

Implementation

Write the code, create files, modify existing

e.g. Every coding task

04

Review

Check for correctness, security, patterns

e.g. After implementation

05

Verification

Run tests, build, type-check

e.g. After every change

06

Refactoring

Improve code quality without changing behavior

e.g. Tech debt, cleanup

07

Deployment

Ship to staging/production

e.g. Feature complete

08

Documentation

Update docs, README, API reference

e.g. Public API changes

09

Reporting

Summarize what was done, outcomes, metrics

e.g. End of session

/ How It Works(03)

Training Data

2,203 real development trajectories from three sources — not synthetic data. Every trajectory records which phases were executed, their order, the models used, cost, and whether the outcome succeeded.

800+

Orchestration Pipelines

9-phase runs captured from Brainstorm CLI production sessions

233 sessions

Claude Code Sessions

90K messages from real development work across 6 projects

400K

RouterBench Dataset

Routing decision datapoints from HuggingFace for cold-start calibration

Architecture

Not a transformer. Not a fine-tuned LLM. 9 per-phase gradient boosted machines(GBMs), each conditioned on prior phase outcomes. Phase 3's model sees whether phases 1 and 2 were included. This sequential dependency is what makes the predictions accurate.

Baseline

Plan cache — keyword template matching

0.587 F1

Sequential GBMs

9 models, each conditioned on prior phases (+36% over baseline)

0.796 F1

Inference

ONNX export, runs in BrainstormRouter hot path

<2ms

Kill Gates

Pass/fail criteria stop training early on failed experiments

Per-phase

/ The Flywheel(04)

Every session trains the next generation

This is not a static model. Every task run through Brainstorm CLI generates a trajectory. Every trajectory flows to BrainstormRouter. The Router aggregates outcomes and periodically retrains the predictor.

The production platforms — BrainstormMSP (37 agents), Brainstorm-GTM (70 agents), Peer10, OurBookNook, FinishStrong, BrainstormEvent — all generate real-world trajectories that validate and improve predictions.

More users → more trajectories → better predictions → lower costs → more users. The flywheel compounds.

01

Developer uses CLI

Prompt classified, routed, executed. Full session trajectory captured.

02

Trajectory → Router

Which phases ran, which models, what cost, did it succeed?

03

Router aggregates

Per-task-type × model performance tracked. Thompson sampling updates.

04

LLM retrains

GBMs retrained on accumulated trajectories. New ONNX exported.

05

Predictions improve

Next task predicted faster, cheaper, more accurately. <2ms overhead.

06

Cycle repeats

Better predictions → better routing → better outcomes → better training data.

/ Trajectory Growth(05)

The acceleration curve

Sprint 1~10/day

Manual CLI usage, single developer (JJ + Claude)

Sprint 2~50/day

Better context → more effective sessions → users do more

Sprint 3~200/day

Docgen generates 50+ trajectories per codebase. Recipes multiply usage.

Sprint 4~1,000/day

Unattended mode, task queues, background agents — parallel trajectory generation

Sprint 5~10,000/day

Cloud agents, enterprise teams, CI/CD integration — continuous stream

Each 10x increase in trajectory volume makes predictions materially smarter — not because of a better model, but because of better data.

/ Roadmap(06)

What's next for BrainstormLLM

v2 (Current)

  • Sequential GBMs (9 per-phase)
  • 0.796 F1 on phase prediction
  • ONNX export, <2ms inference
  • Deployed in BrainstormRouter

v3 (Training)

  • Transformer-based sequence model
  • Cross-project transfer learning
  • Model-specific phase routing
  • Cost-quality Pareto optimization

v4 (Research)

  • Natural language → full pipeline plan
  • Multi-agent orchestration prediction
  • Real-time trajectory streaming
  • Community model (federated learning)

Every task you run makes the system smarter

Install the CLI. Run your first task. Your trajectory joins the dataset that trains the next generation of BrainstormLLM — and improves routing for everyone.