- The paper introduces a novel benchmark that formalizes decision-making as action generation in a compositional space with multi-level explicit conditions.
- It employs combinatorial and arithmetic constraints in finance allocative tasks, evaluated through oracle-based metrics such as DSR, CSR, and NU.
- Experimental analysis reveals that reasoning-augmented models significantly outperform non-reasoning baselines in both constraint satisfaction and objective alignment.
Conditional Decision-Making Evaluation of LLMs with CONDESION-BENCH
Recent adoption of LLMs as decision-support agents in high-stakes domains highlights a critical gap in current evaluation practices: standard benchmarks restrict action spaces to finite, static candidate sets and omit explicit feasibility conditions necessary for real-world deployments. These simplifications mask the intrinsic complexity of compositional decision-making—where actions require multi-variable allocations and must jointly satisfy structured constraints emanating from domain, context, or resource limitations.
CONDESION-BENCH directly addresses these deficiencies by formalizing decision-making as action generation in a compositional space, evaluated under multi-level explicit conditions. The benchmark targets allocative tasks, notably in finance, where realistic decisions involve not just asset selection, but context-driven assignments over a variable set under operational and market constraints. This extension enforces rigorous adherence to feasibility in both combinatorial and arithmetic dimensions, therefore validating models under closer-to-deployment scenarios.
Benchmark Design and Methodology
Compositional Action Space
In contrast to atomic candidate selection benchmarks, CONDESION-BENCH operationalizes the action space A as sets of pairs (vi,ai), with vi as a decision variable (e.g., stock ticker) and ai as its corresponding allocation (quantity to buy/sell). This model is sufficiently general to capture diverse decision regimes, including mixed discrete-continuous settings.
Multi-Level Explicit Conditions
Actions are validated against three condition tiers:
- Variable conditions: Static, scenario-independent filters on the set of selected decision variables (e.g., sector-based exclusion, minimum cardinality of portfolio).
- Contextual conditions: Data-dependent context filters, often requiring the model to extract, compare, or aggregate scenario information (e.g., buy only stocks whose price has risen two consecutive days).
- Allocation conditions: Arithmetic constraints on combined or per-variable allocations (e.g., total budget, per-stock minimum shares).
Each task instance is constructed by sampling at least one condition from each tier, ensuring feasibility (i.e., at least one action satisfies all conditions).
Oracle-Based Evaluation and Metrics
A key aspect is the use of an oracle (with full table-lookup access to realized next-day prices and ground-truth outcomes) to enumerate the set of all feasible actions and to compute both utility-maximizing (oracle optimal) and minimizing actions. Performance metrics include:
- Decision Satisfaction Rate (DSR): Proportion of model outputs satisfying all assigned conditions.
- Condition Satisfaction Rate (CSR): Rate of individual condition satisfaction across outputs.
- Normalized Utility (NU): Relative profit achieved, scaled between the worst and best feasible action per scenario.
- Normalized ROI (NR): When constraints are violated, assesses per-cost profit against the best/worst unbounded actions.
Experimental Analysis
Model Comparisons and Failure Modes
CONDESION-BENCH comprehensively evaluates both proprietary (OpenAI GPT-4/5, Anthropic Claude-3/4, Gemini, xAI Grok) and open-source (Llama-3, Mistral) LLMs. Both non-reasoning and reasoning-augmented variants are considered. Findings:
- Substantial generalization gaps exist: Reasoning models (GPT-5, Claude-4, Gemini-2.5) consistently outperform non-reasoning baselines on both DSR and CSR by large margins. For example, GPT-5 and o3 achieve DSRs above 0.86, whereas non-reasoning LLMs (GPT-4.1, Llama-3-8B) plateau near 0.50 or lower.
- Condition adherence is non-uniform: All models handle static variable conditions more effectively than context-dependent or arithmetic allocation constraints. This is due to the increased requirement for structured scenario parsing and multi-step logical/arithmetical chaining.
- Objective-vs-feasibility tradeoff: Satisfying all constraints does not guarantee utility maximization (NU remains sub-maximal), while non-reasoning models sometimes “optimize” utility by violating constraints—yielding higher raw profits but invalid solutions.
- Sampling improves performance, not consistency: While sampling multiple candidate actions per prompt increases the likelihood of achieving both feasibility and higher utility, intrinsic difficulties in searching compositional-constrained spaces remain. Some models only sporadically achieve high utility via constraint satisfaction, evidencing underdeveloped search or calibration.
Error and Failure Analysis
When actions violate conditions, reasoning-augmented models show a higher likelihood to produce profit-maximizing actions compared to non-reasoning LLMs, whose violations often yield utility well below oracle references. This suggests that explicit reasoning scaffolds do not merely enforce constraint satisfaction but prioritize reward maximization in unconstrained settings—highlighting an inherent model bias towards objective alignment at the cost of feasibility.
Implications and Theoretical Significance
CONDESION-BENCH pushes LLM evaluation into regimes approximating real-world decision complexity, essential for deployment in safety-critical or regulated environments. Its explicit, multi-level condition framework exposes the insufficiency of LLM prompt-following for conditional and compositional reasoning. The differential performance of reasoning-augmented models underscores the promise and limitation of present architectural innovations, and the high DSR-but-low-NU/NR regime signals the need for models that can both parse and optimize under compound feasibility inferences.
From a practical standpoint, reliance on single-shot model outputs is demonstrably inadequate for conditional reasoning tasks; optimization via candidate sampling is necessary but may induce variance, especially in underconstrained models. The methodology for constructing ground-truth references via exhaustive enumeration (oracle evaluation) also provides a robust, scalable standard for future benchmarks outside finance, applicable to any domain with tractable context and action space definitions.
Limitations and Directions for Future Work
Current scope is restricted to finance due to data verifiability and reward tractability. Extensions to healthcare, operations, or general multi-agent negotiation require richer contextualization and potentially the modeling of soft or implicit constraints. Furthermore, the current focus on “hard” constraints (mandatory satisfaction) omits the complexity introduced by cost-penalized violations, preference aggregation, or long-term horizon objectives.
Future benchmark iterations may expand by:
- Adding soft or probabilistic constraints and reward shaping.
- Incorporating dynamic environments with multi-step or temporally extended tasks.
- Evaluating robustness to adversarial or ambiguous instruction contexts.
Conclusion
CONDESION-BENCH introduces a rigorous paradigm for evaluating LLMs in compositional, condition-rich action spaces, targeting both constraint adherence and objective performance. Experimental results delineate fundamental gaps in current LLM capabilities, especially in integrating context-sensitive logical and arithmetic reasoning under constraint. As LLMs advance, CONDESION-BENCH will serve as a critical testbed for both incremental improvements and the development of architectures explicitly targeting bounded rationality in real-world decision-making.
Reference: "CONDESION-BENCH: Conditional Decision-Making of LLMs in Compositional Action Space" (2604.09029)