Internal learned knowledge
Models may use general priors already inside the model. Every model receives the same external context.
Methodology
CapitalBench asks each model one market question under the same frozen inputs, preserves the full audit trail, and scores the selected option only after the fixed horizon resolves.
Each model chooses exactly one option from the frozen universe. The official leaderboard uses one valid call per model.
No browsing, tools, live retrieval, or intentional use of facts after the research cutoff.
Realized selected-option return is compared with the S&P 500 benchmark over the manifest horizon.
Models may use general priors already inside the model. Every model receives the same external context.
Tools, search, live prices, and intentional use of post-cutoff facts are disabled or disallowed.
The official leaderboard uses one valid one-shot pick per model from the selected official run.
Mock, provider-smoke, retrospective, failed, incomplete, and invalid submissions are excluded.
CapitalBench evaluates one narrow question: given the same frozen market context, which single market option does a model choose for the next month? It is a benchmark for reproducible model comparison, not a trading system, portfolio optimizer, or investment recommendation engine.
Models may use internal learned knowledge and general market priors. They do not have to behave like blank slates. The controlled part is the externally supplied information: every model receives the same prompt, briefing, option universe, and optional mechanical market-data table.
rounds/<round_id>/ with a manifest, prompt, briefing, option universe, price files, hashes, and isolated run folders.capitalbench hash-round writes SHA-256 hashes for the round inputs before submissions are collected.run_id.The local round directory is the audit source of truth. Supabase stores normalized published copies for the website, but the canonical record remains the hashed round artifact set.
| Artifact | Audience | Purpose |
|---|---|---|
manifest.yaml | Public | Round metadata, decision deadline, entry rule, exit rule, horizon, and methodology version. |
prompt.md | Model-facing | The exact task instruction sent to every model. |
briefing.md | Model-facing | Neutral factual context available at decision time. |
options.yaml | Model-facing | The only valid choices. Each public submission must select exactly one option id. |
market_data/universe_trailing_returns.* | Model-facing when present | Mechanical 7-day, 30-day, 6-month, and 1-year trailing returns from adjusted close data. |
hashes.json | Public audit | SHA-256 hashes proving the frozen input files used for the round. |
research/* | Audit, except final briefing | Research manifest, hashes, source fact report, audit report, and final model-facing briefing. |
runs/<run_id>/* | Audit and scoring | Raw responses, normalized raw payloads, parsed submissions, run logs, validation summaries, and results. |
Deep research output is stored as audit material first. The only research artifact copied into the model
prompt is research/final_briefing.md, which becomes round-level briefing.md.
Market fact reports, source ledgers, and briefing audit reports remain audit-only.
The model-facing briefing should include facts, dates, values, forecasts labeled as forecasts, scheduled catalysts, and source-reported uncertainties. It should not include opinion, interpretation, scenario analysis, "why it matters" commentary, affected-market mapping, recommendations, or option rankings.
capitalbench import-research \
--round rounds/<id> \
--market-fact-report market_fact_report.md \
--audit-report briefing_audit_report.md \
--final-briefing final_briefing.md \
--research-cutoff-utc "YYYY-MM-DDTHH:MM:SSZ" Public rounds use CapitalBench Universe v1.5: a fixed ETF universe plus CASH. The model sees readable option ids, names, public symbols, asset classes, categories, groups, risk buckets, and exposure descriptions. Internal fields and provider-specific data-fetching fields are kept out of the prompt.
All non-cash options are US-listed ETF tickers and must validate against Tiingo EOD data before the round is frozen. CASH has no ticker and is skipped during Tiingo validation.
capitalbench validate-universe \
--round rounds/<id> \
--start-date YYYY-MM-DD \
--end-date YYYY-MM-DD Each model must return one JSON or YAML object. A submission with multiple selected assets is invalid. Invalid raw responses remain preserved, but they are not scored and cannot enter a public official leaderboard.
round_id model_id provider mode: closed_capability run_type replicate_index replicate_count is_official_score selected_option_id confidence rationale_summary key_risks
Official runs require replicate_index: 1, replicate_count: 1, and
is_official_score: true. Stability runs use repeated replicate indexes for each model and require
is_official_score: false.
The official result is the selected asset from one valid provider call in the selected official run.
Stability runs ask the same model the same question multiple times to measure decision consistency.
0 where supported.capitalbench run-round \
--round rounds/<id> \
--models configs/models.local.yaml \
--run-id official-YYYYMMDD \
--run-type official \
--allow-real-api-calls An official retry is allowed only when no valid decision can be parsed because of infrastructure or format failure: malformed JSON, truncated response, provider transport or API failure, or schema output failure.
A retry is not allowed because of the selected asset, confidence value, or rationale quality. Failed raw responses must remain in the run artifacts and must stay ineligible for public official scoring.
CapitalBench scores valid submissions against local price files. Adjusted close is preferred. If only close is supplied, scoring may continue but records a warning in the result artifacts. Tiingo fetching is strict about dates and requires rows matching the manifest entry and exit dates.
exit_price / entry_price - 1 selected_return - sp500_return best_option_return - selected_return alpha_vs_sp500 / cost_usd Cash is treated as a zero return unless cash prices are explicitly supplied. The main official leaderboard is sorted by alpha versus S&P 500 descending. Ties are resolved by lower regret, higher confidence, and then model id.
capitalbench fetch-prices \
--round rounds/<id> \
--run-id <run_id> \
--entry-date YYYY-MM-DD \
--exit-date YYYY-MM-DD \
--full-universe Newest resolved official one-shot run only. If the latest round is pending, picks may be shown but performance is withheld.
Average official alpha versus S&P 500 across resolved rounds where each model has an official result. New models are not backfilled into old rounds.
Average repeated-run alpha and average consistency across resolved stability runs. This view stays separate from official scoring.
manifest.yaml, briefing.md, options.yaml, prompt.md, and any prompt-facing market_data/ artifact before collection.hashes.json so readers can verify that inputs did not change after model calls.
The CLI stores exact provider text in local raw_responses/ sidecars, normalized payloads in
submissions/raw/, validated submissions in submissions/parsed/, and SHA-256 paths in
run_log.jsonl. Raw provider text and private smoke-test output are excluded from the public repo;
reports, validation summaries, result CSVs, and public hashes are generated from the sanitized artifacts.
When the Supabase URL and server-side service-role credentials are configured, publish and scoring commands can sync normalized public rows to Supabase for website rendering. The public frontend uses only the Supabase anon key. Service-role keys, provider keys, and market-data keys are never exposed in built assets.
capitalbench sync-web --round rounds/<round_id> --run-id <run_id>
capitalbench sync-web --rounds-dir rounds --include-cumulative CapitalBench measures one prompt, one option set, and one time window at a time. It does not model taxes, transaction costs, slippage, liquidity, dividends, position sizing, or portfolio construction. A one-month result can be dominated by noise, and a round where many models choose the same asset can be fair but low-discrimination as a ranking event.
The framework is useful for reproducible comparison, but it should not be read as proof that a model has durable investing skill. The benchmark output is research, not financial advice.