AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

Published 18 Apr 2026 in cs.LG and cs.AI | (2604.16804v1)

Abstract: Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires specialized operations research (OR) expertise, making it hard to scale. We present AutoOR, a scalable synthetic data generation and reinforcement learning pipeline that trains LLMs to autoformalize optimization problems specified in natural language across linear, mixed-integer, and non-linear categories. AutoOR generates verified training data from standard optimization forms and uses solver execution feedback as the reward signal for RL post-training. AutoOR applied to an 8B model achieves state-of-the-art or competitive results across six established OR benchmarks, matching significantly larger frontier models. For a non-linear problem class involving physical dynamics, where frontier models score near 0%, we introduce a curriculum RL strategy that bootstraps from limited initial training data to make this class tractable for post-training. We believe that methods such as AutoOR can significantly accelerate industrial decision-making with AI.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a scalable RL post-training framework that uses deterministic solver feedback to improve autoformalization in operations research.
It leverages synthetic data generation through structural asymmetry to create verified training pairs for LP, MILP, and nonlinear problems.
Empirical results show AutoOR outperforms SFT and frontier models by achieving up to 97.2% accuracy and significant gains on complex non-linear optimization tasks.

AutoOR: RL-Based Post-Training of LLMs for Autoformalization in Operations Research

Motivation and Problem Formulation

Autoformalizing optimization problems—transforming unstructured natural language descriptions into solver-readable formulations—is a foundational requirement in operations research (OR) applications spanning manufacturing, logistics, and industrial scheduling. Historically, this translation required considerable human expertise, imposing significant bottlenecks as industrial decision-making scales. Existing LLM-centric approaches leverage prompting, multi-agent orchestration, and supervised fine-tuning (SFT) with synthetic datasets, but these methods lack scalability, accuracy, and coverage for realistic or non-linear problem classes. They are further constrained by the scarcity of verified training data and inability to generalize beyond linear and mixed-integer programs.

AutoOR intervenes by turning the autoformalization pipeline into a scalable, verifiable RL domain. The approach synthesizes high-fidelity training data from standard forms and bootstraps online RL post-training using deterministic solver execution as reward feedback. This unlocks model generalization across LP, MILP, and previously intractable non-linear optimization problems, addressing both practical deployment and theoretical limitations.

Figure 1: AutoOR overview illustrating synthetic problem generation and RL training loop with solver-execution rewards.

Synthetic Data Generation via Structural Asymmetry

AutoOR capitalizes on the structural asymmetry in optimization modeling. For any category where a standard mathematical template exists, fully-instantiated parameters directly yield correct solver code. These instantiations are then backtranslated to natural language descriptions—an operation that is easily verifiable component-wise using LLMs. The pipeline comprises four stages:

Template Selection: Domain-appropriate standard forms for LP, MILP, or NLP.
Instantiation: LLMs are used to populate templates with parameters, generating world descriptors ( $w$ ), ground-truth code ( $c^*$ ), and solutions ( $\mathbf{x}^*$ ).
Backtranslation: LLMs convert $w$ into domain-relevant natural language descriptions $d$ (with control over complexity and incompleteness for multi-turn settings).
Verification: High-throughput LLM-based checks validate description fidelity relative to $w$ , yielding a scalable, verified synthetic dataset.

This strategy ensures nearly every accepted pair is correct-by-construction, mitigating the noise/ambiguities plaguing prior synthetic datasets. For non-linear domains with physical dynamics, the pipeline bootstraps from domain-specific exemplars (e.g., pump network synthesis), scaling via parameter variation while preserving the governing equations.

Reinforcement Learning Training with Verifiable Rewards

AutoOR employs Group Relative Policy Optimization (GRPO) for post-training of Qwen3-8B. The RL loop is grounded in verifiable rewards derived from deterministic solver feedback:

Reward Function: Composite reward for code execution, constraint feasibility, and objective optimality (with increasing weights), computed programmatically and only for fully correct solutions.
Advantage Estimation: Group normalization across sampled code rollouts, which robustly handles the high variance of code generation.
Policy Update: Clipped surrogate objective with per-token importance sampling.

This RLVR setup is particularly suited to the domain, as solver verification eliminates reward model noise and enables robust learning from execution results—a paradigm validated in RL-based formal math and code generation research. Curriculum RL with privileged information (e.g., solver syntax) is introduced for hard non-linear classes, overcoming the cold-start barrier wherein frontier models produce no solvable rollouts.

Figure 2: Curriculum RL stages for non-linear optimization, enabling model bootstrapping and tractable RL learning signal.

Empirical Results and Benchmark Comparison

AutoOR-8B is evaluated across six established and three new benchmarks, compared against baseline and frontier models (Gemini 3 Pro, Gemini 2.5 Pro, Qwen3-8B), as well as published learning-based methods (ORLM, Step-Opt, OptMATH, OR-R1). Main findings:

State-of-the-Art Performance: AutoOR matches or surpasses Gemini 3 Pro (significantly larger scale) in LP/MILP autoformalization with up to 97.2% accuracy (NL4LP) and strong performance in more complex settings (MAMO-Complex, IndustryOR).
Hard Non-Linear Problems: On a non-linear pump network synthesis benchmark, frontier models score near 0%, while AutoOR achieves 48.98% accuracy, enabled by curriculum RL and scalable synthetic data.
Learning-Based Superiority: RL post-training (AutoOR) outperforms SFT-centric methods, particularly on harder constraints and non-linear categories. The gap here is pronounced on benchmarks with longer descriptions and intricate constraint structures.
Figure 3: Illustration of the pump network synthesis problem where AutoOR demonstrates substantial RL-enabled gains over baselines.

Multi-Turn Formalization and Agentic Capabilities

Recognizing the practical necessity for agentic behavior in optimization modeling, AutoOR incorporates a multi-turn RL setup where models are trained to actively query for missing information. This is instantiated as a simulated human-in-the-loop environment, with two-turn GRPO optimizing both clarification query precision and final code correctness.

Experimental Results: Multi-turn RL yields a dramatic accuracy improvement (66.75% vs 18.5% uninformed baseline), closing most of the gap to the full-information regime.
Scalability: The framework is readily extensible to higher-turn interactions and tool integration, setting the stage for broader agent-based optimization assistance.

Practical and Theoretical Implications

AutoOR advances both the practical deployment and theoretical feasibility of scalable autoformalization:

Practical Accessibility: Enables competitive performance with small, open-source LLMs, reducing reliance on expensive proprietary frontier models and supporting privacy-preserving model training on domain-specific distributions.
Generalization Beyond SFT: RL post-training incentivizes solution exploration and verifiable reasoning, overcoming the memorization pitfalls of pure SFT. Particularly in code and optimization contexts, this translates to improved out-of-distribution performance and scalability.
Agentic Industrial Decision-Making: The multi-turn pipeline lays groundwork for LLMs as active participants in real-world modeling workflows, with iterative information gathering, tool use, and model refinement.

Limitations and Future Directions

AutoOR’s non-linear success is demonstrated on pump synthesis but requires extension to broader NLP domains (e.g., quadratic programming, conic optimization, multi-objective settings). Reward coarseness (execution, feasibility, optimality checks) does not yet capture formulation efficiency or solver tractability differences. Multi-turn setups are currently simulated; practical deployment will require richer tool integrations and iterative agent post-training.

Future directions include expanding coverage to more diverse non-linear classes, training for formulation efficiency and solver-agnostic outputs, and scaling agentic behaviors with tool calls and real-world multi-turn interactions. These developments will further advance industrial decision automation and optimization accessibility.

Conclusion

AutoOR transforms the autoformalization of OR problems into a verifiable RL domain by leveraging scalable synthetic data generation and programmatic reward feedback. RL post-training on small open-source LLMs matches or outperforms frontier models on a wide array of benchmarks, achieves tractability on previously unsolved non-linear classes via curriculum, and demonstrates agentic capabilities through multi-turn formalization. This establishes a practical, generalizable paradigm for expert-level optimization modeling and supports the advancement of LLM-powered industrial decision-making.