Same Prompt Surface
Models receive the same frozen prompt, briefing, option universe, and allowed response schema for an official run. The benchmark is not trying to equalize the internal knowledge of each model; it is controlling the external information surface supplied at decision time.
Round inputs are hashed before official model calls. The public site mirrors the published hashes so readers can verify that the prompt, briefing, universe, and market-data context were fixed before submissions were collected.
No Public Backfills
New models become eligible for future rounds only. Older official leaderboards remain unchanged, which prevents later model releases from being scored against historical market states they did not face in public at the time.
Run Isolation
Official, stability, mock, retrospective, and provider smoke runs live under separate run IDs. Official public leaderboards only use the selected official run for a round. Repeated stability runs are shown separately because they answer a different evaluation question.
Tool And Retrieval Boundary
Official model calls do not use browsing, live retrieval, trading tools, or hidden post-cutoff data. If a provider run fails, the failed raw attempt can remain in the audit trail, but it does not become an official public result.
Invalid Submission Handling
A public official result requires a schema-valid single selected option. Multiple selections, missing option identifiers, unparsable responses, and failed attempts are excluded from official scoring. This keeps the leaderboard tied to comparable one-shot decisions.