- The paper introduces a temporally granular benchmark that evaluates LLM forecasting agents across five market lifecycle checkpoints, highlighting early strengths and later decline.
- Web search augmentation and simple ensembling demonstrate mixed performance gains, emphasizing the need for selective, context-aware tool use.
- Empirical findings reveal that LLMs perform competitively in uncertain, early market conditions but falter as market consensus emerges, guiding adaptive forecasting strategies.
Temporal Reliability of Agentic Forecasters: An Analysis of TimeSeek
Introduction and Motivation
"TimeSeek: Temporal Reliability of Agentic Forecasters" (2604.04220) presents a rigorous evaluation of state-of-the-art LLM-based forecasting agents, focusing on their predictive reliability over the lifecycle of binary prediction markets. Unlike prior benchmarks that typically assess LLM forecasters at a single temporal snapshot, this work introduces systematic temporal granularity by benchmarking models at five distinct points in each market's evolution. This protocol enables fine-grained analysis of the relative strengths and failure modes of LLM forecasters, especially in comparison with market-implied prices, across information and uncertainty regimes. The study further investigates the impact of web search augmentation and simple ensembling strategies.
Experimental Design
Dataset and Evaluation Protocol
The benchmark consists of 150 high-volume, CFTC-regulated Kalshi binary markets spanning Politics, Sports, Macro-Economics, Science/Tech, and Financial categories. Each market is sampled at five lifecycle checkpoints: from near-open ("Open+1") to near-resolution ("Close-1"). This results in 750 unique tasks per forecasting condition.
For each of 10 leading LLMs, predictions are elicited under two conditions: with access to contemporaneous web search (agentic) and without (parametric knowledge only). Tool access is strictly date-filtered to eliminate leakage from post-cutoff information. In total, the evaluation comprises 15,000 individual forecasts.
Model Pool and Metrics
Models evaluated include Claude Opus 4.5, GPT-5.2, Gemini 3 Pro, Grok 4.1-fast, Kimi-k2, Kimi-k2.5, DeepSeek v3.2, Intellect-3, Trinity Large, and Qwen3-235B. Performance is quantified using the Brier Score (BS) and Brier Skill Score (BSS; relative to market price). Positive BSS indicates outperformance vs. markets at the corresponding temporal checkpoint.
Key Empirical Findings
Temporal Dynamics of Model Reliability
LLMs with web search achieve their strongest market-relative performance early in market lifecycles, particularly at Open+1, with several models (Claude, Kimi-k2.5, GPT-5.2, Kimi-k2) showing positive BSS. However, by the Close-1 checkpoint, every model is substantially outperformed by the aggregation of market participants, with BSS values collapsing below -0.7 for all models. This temporal degradation is monotonic for 9 of 10 models, consistent with market processes that incorporate public and private signals over time, eventually outpacing the information extractable by static or periodically retrieving LLMs.
This strongly supports the necessity of time-sensitive evaluation and highlights that LLM models cannot be reliably benchmarked using only a single market snapshot.
Web search systematically improves average model performance—pooled BSS gains range from +0.14 to +0.59. However, in 12% of model-checkpoint instances, retrieval hurts performance, and the benefit is highly heterogeneous across both time and category. Notably, GPT-5.2's early checkpoint performance is degraded by search, and Kimi-k2 experienced performance drops from search at multiple temporal points. The relationship between query count and model success is non-monotonic, further underscoring that indiscriminate or excessive tool use is not beneficial. The implication is clear: tool-use policies should be selective and potentially category-aware.
Market Difficulty Regimes
When market-implied prices indicate high uncertainty ("toss-ups"), LLMs are most competitive, with up to seven models achieving positive BSS. As crowd consensus strengthens, all models are decisively outperformed (BSS: -0.69 to -1.68 for "easy" markets). This aligns with theoretical expectations that markets are strongest when aggregating widely available signals, and that LLMs' comparative advantage lies in surfacing and contextualizing public information not yet fully priced in low-information early regimes.
Contrarian forecasting by LLMs is most effective against the market on toss-up or hard markets, with Claude demonstrating a win rate of 80.9% when making strong counter-market predictions in high-uncertainty conditions, but only 19.8% when market consensus is strong.
Ensembling and Model-Market Error Structure
Error correlations between LLMs and markets are modest (Pearson r=0.19 to $0.41$), suggesting partially independent failure patterns and thus opening potential for ensembling. Simple two-model averages (e.g., Claude + Kimi-k2.5) reduce pooled BSS loss by 40%. However, no LLM ensemble surpasses the market baseline overall, emphasizing that ensemble-based advances currently offer incremental, not transformative, improvement.
Category-Level Effects
Statistically significant category heterogeneity is observed. Macro-economic and Sports markets yield positive BSS for several models, while Politics and Financial markets remain challenging. In terms of web search augmentation, search delivers strong gains in Macro and Sports (all models see positive ABSS), is neutral or negative in Politics and Sci/Tech, and mixed in Financial markets. This domain sensitivity motivates the design of category-conditional tool-use and deferral policies.
Theoretical and Practical Implications
The analysis reframes AI forecasting as a selective decision problem, where at each market-time-category slice, agents must choose whether to defer to market prices, rely on parametric knowledge, or invoke tool use. This paradigm aligns with research on selective prediction, abstention, and gating policies. It also motivates post-hoc calibration and uncertainty quantification as essential preprocessing steps for any policy that combines or alternates between AI agents and market baselines.
From a practical perspective, the results underscore that LLMs equipped with retrieval are most practically valuable as early-cycle forecasters and in high-uncertainty or thin-liquidity environments. The declining marginal benefit of retrieval as resolution approaches further suggests that the real-world deployment of agentic AI in forecasting should rely on adaptive, time-aware strategies—not static tool invocation rules.
Limitations and Future Research Directions
Key limitations include the lack of confidence intervals on subgroup analyses, the use of market prices as both difficulty proxy and baseline (thus analyses are market-relative, not intrinsic), and the exclusion of execution considerations (transaction costs, latency, market impact). Only web search was considered, leaving open questions about the value of richer external tool suites.
Priority directions for future work include learning explicit gating budgets and policies for search and market deference, streaming or online forecasting protocols, mechanism-aware sandboxes for controlled study, hybrid market-model weighting schemes with calibration guarantees, and RL fine-tuning on market-resolved outcomes.
Conclusion
TimeSeek provides a comprehensive, temporally granular benchmark for the evaluation of agentic LLM forecasters against live, financial-stakes prediction markets. The study generates several robust findings: LLMs are most competitive with markets early and under uncertainty, the marginal value of web search decays and even reverses in certain settings, and simple ensembling yields partial but incomplete gains. These insights have direct implications for both the theory and deployment of agentic forecasting AI: optimal reliability will require temporal, category, and possibly instance-level adaptivity, along with a selective, risk-calibrated approach to both tool use and market deference.