How Thompson Sampling Learns Which Model to Use
A technical deep-dive into the multi-armed bandit algorithm that powers Brainstorm's intelligent model routing — from cold-start UCB1 to steady-state Gaussian Thompson sampling.
Brainstorm routes every task to the optimal model. But how does it decide what "optimal" means — and how does it learn over time without requiring manual configuration? The answer is Thompson sampling, a Bayesian approach to the multi-armed bandit problem that has been solving exploration-exploitation trade-offs since 1933.
The Multi-Armed Bandit Problem
Imagine you are in a casino with 10 slot machines. Each has a different (unknown) payout distribution. You want to maximize your total reward over many pulls. If you always play the machine that has paid best so far, you might miss a better one you have not tried enough. If you explore too much, you waste pulls on bad machines. This is the exploration-exploitation dilemma.
Model routing is exactly this problem. You have 10+ models. Each has an unknown quality distribution for a given task type. You want to route to the best model, but you also need to occasionally try others to discover if they have improved (or if a new model is actually better than your current favorite).
Cold Start: UCB1
When Brainstorm has fewer than 500 observations for a task-model pair, there is not enough data for Bayesian inference to be reliable. During this cold-start phase, we use UCB1 (Upper Confidence Bound), a frequentist algorithm that balances exploration and exploitation with a simple formula:
``
score = mean_reward + C * sqrt(ln(total_pulls) / pulls_for_this_arm)
`
The second term is an exploration bonus that grows for under-sampled models. The constant C` controls how aggressively we explore. UCB1 is deterministic and requires no distributional assumptions — it just ensures every model gets tried enough to form a reliable estimate.
Steady State: Gaussian Thompson Sampling
Once we cross the 500-sample threshold, we switch to Gaussian Thompson sampling. For each model, we maintain a posterior distribution over its expected quality score. On each routing decision:
1. Sample a value from each model's posterior distribution
2. Select the model with the highest sampled value
3. Observe the quality score from the actual response
4. Update the posterior with the new observation
Because we sample from the posterior rather than just picking the highest mean, models with high uncertainty naturally get explored more often. A model with mean 0.7 and high variance will occasionally sample above a model with mean 0.8 and low variance — giving it a chance to prove itself.
7-Day Rolling Welford Accumulators
Model performance is not static. Providers update models, latency varies by time of day, and new models appear regularly. If we used lifetime statistics, a model that was great six months ago but degraded last week would still dominate routing.
Brainstorm uses 7-day rolling Welford accumulators to maintain running mean and variance estimates. The Welford algorithm computes variance in a single pass with numerical stability — no need to store every observation. By rolling the window over 7 days, we automatically adapt to model drift without manual intervention.
Quality Signal Pipeline
The Thompson sampler is only as good as its reward signal. Brainstorm constructs a composite quality score from multiple dimensions:
- Task completion — did the model produce a valid, complete response?
- Tool call accuracy — were tool calls well-formed and did they succeed?
- Token efficiency — how many tokens were used relative to task complexity?
- Latency — time to first token and total response time
- User signals — explicit thumbs up/down and implicit signals (did the user retry with a different model?)
These dimensions are weighted by task type. For code generation, tool call accuracy dominates. For Q&A, completion quality and latency matter more. For cost-sensitive tasks, token efficiency gets higher weight.
Why Thompson Beats Static Rules
A rule-based router like "use Opus for architecture, Sonnet for code, Haiku for simple tasks" is a reasonable starting point. But it cannot adapt to reality:
- Model updates change the landscape. When a provider ships a major model update, static rules are immediately stale. Thompson sampling detects the shift within days through its rolling window.
- Task boundaries are fuzzy. Is a "refactor this function" task simple code or architecture? Rules require crisp categories. Thompson learns the gradient.
- Cost-quality frontiers shift. A model that was expensive last month might get a price cut. Thompson automatically exploits the new cost-quality ratio.
- Individual patterns matter. Your coding style, your codebase, your tool usage patterns all affect which model works best for you specifically. Thompson learns *your* optimum, not the population average.
The Result
After approximately 2,000 routing decisions, Brainstorm's Thompson sampler typically converges to a stable policy that outperforms any static configuration by 15-25% on composite quality scores — while spending 30-40% less on API costs. The exploration rate naturally decreases as confidence increases, but never reaches zero, ensuring continuous adaptation.
The math is elegant. The implementation is practical. And the outcome is simple: you stop thinking about which model to use, and start thinking about the work itself.