SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Published 2 Feb 2026 in cs.LG | (2602.02383v2)

Abstract: Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning LLMs. Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response's absolute likelihood. This can lead to unlearning, where the model degrades the probability of high-quality outputs to satisfy margin constraints, and formatting collapse caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper proposes SLIME, a novel objective that decouples preference enforcement from likelihood maximization to prevent unlearning.
It introduces likelihood anchoring and token-level stabilization, ensuring that preferred outputs maintain high-quality linguistic properties.
Empirical results on models like Gemma3-4B show SLIME significantly boosts instruction-following and robustness compared to DPO and SimPO.

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Motivation and Problem Analysis

Direct preference optimization (DPO) and related techniques, such as SimPO and IPO, have been widely adopted as computationally efficient alternatives to RLHF for LLM alignment. These methods reframe preference learning as a classification or ranking task over pairs of model outputs with human feedback. However, the paper identifies a fundamental objective mismatch in margin-based approaches: maximizing the relative gap between chosen and rejected responses does not ensure preservation of the absolute likelihood of high-quality generations. As a result, models can satisfy the margin constraint by reducing the likelihood of both the preferred and rejected outputs, which leads to unlearning of syntax, degraded fluency, and distribution collapse.

The authors formalize two dominant instability modes: (1) unlearning, where the probability of preferred outputs is unnecessarily decreased, and (2) formatting collapse, where token-level over-penalization of rejected sequences erases important linguistic regularities. These phenomena are shown to be exacerbated in reference-free objectives, especially SimPO, which omits the reference model entirely.

The SLIME Objective: Architecture and Methodology

SLIME (Stabilized Likelihood Implicit Margin Enforcement) is proposed as a reference-free preference optimization objective that robustly decouples preference enforcement from generative capacity. The key methodological contributions are:

Likelihood Anchoring: An explicit supervised term on the preferred responses ensures the model maximizes their likelihood, preventing the collapse of beneficial behaviors learned during pretraining or supervised fine-tuning.
Token-Level Stabilization: Instead of indiscriminately suppressing the likelihood of rejected responses, SLIME employs a non-linear, softplus-based penalty at the token level. This term penalizes token probabilities that are forced too close to zero, acting as a filter against over-suppression and maintaining linguistic coverage even among rejected generations.
Dual-Margin Optimization: The preference margin is enforced using a hybrid of hard and soft constraints. The hard margin imposes a strict boundary for optimization, ensuring that once a sufficient margin is reached, gradients vanish. The soft margin utilizes a sigmoidal gating function to maintain precise, effective optimization pressure near the critical boundary, thereby avoiding the vanishing or exploding gradients typical of pure hard/soft objectives.

Formally, the SLIME loss decomposes as $L(\theta) = L_w(\theta) + L_l(\theta) + L_\text{dist}(\theta)$ , with each term addressing one instability mode observed in prior methods.

Empirical Evaluation

Extensive experiments are conducted using standardized evaluation protocols and three distinct model architectures: Llama3.2-3B, Qwen3-4B, and Gemma3-4B. Training employs a two-stage pipeline: initial supervised fine-tuning on a subset of UltraFeedback data, followed by preference alignment using DPO, SimPO, or SLIME. All fine-tuning leverages a high-capacity LoRA adaptation configuration, and models are evaluated on MT-Bench (general instruction-following) and Arena-Hard (robustness/adversarial) benchmarks.

Core Findings

Superior Instruction-Following and Robustness: Across architectures, SLIME surpasses both DPO and SimPO, yielding higher MT-Bench and Arena-Hard scores. For Gemma3-4B, SLIME achieves an MT-Bench score of 6.15 (over 30% above SFT) and Arena-Hard score of 13.1, consistently setting the highest marks.
Mitigation of Unlearning: SimPO underperforms SFT on Llama3.2-3B for MT-Bench (4.22 vs. 4.56) and exhibits severe distribution collapse on Gemma3-4B for Arena-Hard. In contrast, SLIME demonstrates consistent improvements and avoids performance collapse, evidencing the effectiveness of the anchoring and stabilization terms.
Contribution of Individual Loss Components: Ablation studies confirm the necessity of each SLIME term. Removal of either the chosen sequence anchor or the token stabilization leads to significant performance drops, validating the theoretical analysis. The dual-margin approach is also shown to be crucial, as omitting either the hard or soft margin reduces scores relative to the full objective.
Hyperparameter Sensitivity: The exponent of the stabilization penalty ( $p$ ) is shown to be important for balancing stability versus flexibility, with optimal results at intermediate values (2.5).
Generalization Across Models: Even on Qwen3-4B, where pretraining already instills strong instruction-following ability, SLIME delivers improvements on robustness benchmarks, indicating its compatibility with differing baseline capabilities.

Practical and Theoretical Implications

The SLIME framework exposes a critical flaw in current offline preference optimization practice: alignment losses that target only margin maximization can actively degrade language modeling capacity. By decoupling preference-driven gradient flow from guided likelihood maximization and regularized coverage, SLIME advocates for a more granular, componentized loss design paradigm.

Practically, SLIME’s reference-free and parameter-efficient characteristics facilitate integration into modern training pipelines without requiring an explicit reference model or extensive reconfiguration of the optimization stack. The explicit token-level stabilizing penalty can be interpreted as a regularization strategy that prevents distributional drift, which has direct implications for safe deployment and robust incremental alignment.

Theoretically, SLIME motivates future exploration into more sophisticated decoupled objective formulations that robustly preserve pretraining capabilities while accommodating arbitrary downstream preference signals. Its dual-margin mechanism suggests a fruitful axis along which optimization boundaries can be tuned for task-specific trade-offs. Extensions to sequence-level rewards and online settings, including integration with critic-free RLHF variants, are directly motivated by the presented findings.

Conclusion

SLIME introduces a technically rigorous, empirically validated objective for preference-based LLM alignment that remedies objective mismatch in margin-based losses. By augmenting preference optimization with likelihood anchoring and targeted stabilization, it prevents the loss of pretraining-acquired abilities and enhances both instruction-following and robustness across strong open base models. The decomposed loss perspective advanced by SLIME frames a new standard for reference-free alignment methodology. Future work should address generalization to larger models, alternative preference datasets, online policy learning, and multi-lingual settings to further validate and extend SLIME’s applicability.

Markdown Report Issue