PLOT: Enhancing Preference Learning via Optimal Transport

Published 2 Apr 2026 in cs.CL | (2604.01837v1)

Abstract: Preference learning in LLMs has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic & Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper proposes an OT-based loss that globally aligns token distributions with human preferences via semantic embeddings.
It demonstrates significant reductions in attack success rates and improvements in benchmarks for human values and logical problem solving.
The method integrates seamlessly with various fine-tuning strategies, preserving general model capabilities while enhancing robustness.

PLOT: Enhancing Fine-Tuning-Based Preference Learning via Optimal Transport

Motivation and Problem Formulation

Existing preference alignment strategies for LLMs, including RLHF paradigms and direct fine-tuning-based methods (SFT, DPO, PRO, AOT), are constrained by high computational demands, local rather than global optimization at the token level, hyperparameter sensitivity, and insufficient modeling of semantic dependencies. Many approaches apply divergence measures (e.g., KL, JS) or token-level losses without considering global semantic structure and, critically, do not offer a principled framework for aligning the entire output distribution with human-preferred behaviors. As a result, aligned models often remain vulnerable to adversarial and jailbreak attacks and can exhibit suboptimal robustness and generalization under preference constraints.

The paper proposes Preference Learning via Optimal Transport (PLOT), which reframes preference learning as a structured distributional alignment problem. By leveraging Optimal Transport (OT), PLOT computes the minimal transportation cost—incorporating semantic token embeddings—between the model output distribution and the human preference distribution. This enables fine-tuning objectives that penalize distributional misalignment in a way that is holistically informed, theoretically principled, and operationally robust.

Method: Optimal Transport-Based Preference Alignment

Central to PLOT is the integration of a semantic-aware OT-based loss $\mathcal{L}_{\text{PLOT}}$ into any standard fine-tuning objective:

$\mathcal{L} = \mathcal{L}_{\text{vanilla}} + \alpha \mathcal{L}_{\text{PLOT}}$

where $\alpha$ regulates the strength of the global alignment penalty.

Token-level preference distributions are constructed using positive/negative preference datasets. The preference distribution $\mathcal{P}_t$ is computed as:

$\mathcal{Q}_{\text{diff}} = \frac{\mathcal{Q}_+}{\sum \mathcal{Q}_+} - \frac{\mathcal{Q}_-}{\sum \mathcal{Q}_-}$

where $\mathcal{Q}_+$ and $\mathcal{Q}_-$ are frequency-based token distributions from preferred and rejected output pairs, followed by a normalization and a non-negative transformation.

To encode semantic relationships, PLOT computes a cost matrix $C$ based on $l$ -norms (typically Euclidean, $l=2$ ) of the token embeddings:

$\mathcal{L} = \mathcal{L}_{\text{vanilla}} + \alpha \mathcal{L}_{\text{PLOT}}$ 0

allowing transportation cost to reflect semantic similarity in embedding space, unlike default position-agnostic or 0-1 cost matrices. The overall OT problem is then evaluated, and, for computational tractability, reduced to the one-dimensional Wasserstein-1 distance (Earth Mover's Distance) between cumulative distributions projected via embeddings.

The final $\mathcal{L} = \mathcal{L}_{\text{vanilla}} + \alpha \mathcal{L}_{\text{PLOT}}$ 1 term penalizes the aggregated semantic transport cost between the current model output distribution and the preference-aligned target. This global, embedding-aware distributional loss is universally composable with existing preference learning frameworks.

Empirical Evaluation

Datasets and Baselines

Experiments target two principal preference axes: (1) Human Values (Harmlessness, Helpfulness, Humanity), and (2) Logic {content} Problem Solving (Mathematics, Reasoning, Coding, STEM), with standard datasets (e.g., HH-RLHF, MT-Bench, GSM8K, MATH) and advanced adversarial evaluation protocols (e.g., HarmBench, multiple red teaming attack methods such as Zero-Shot, PEZ, GBDA, UAT, SFS, GCG).

Baselines include SFT, DPO, PRO, and AOT, with PLOT integrated as a plug-and-play loss term.

Model families evaluated include Llama3.2-3B, Llama3.1-8B, and Qwen2.5-7B.

Main Quantitative Results

PLOT produces highly consistent, statistically robust improvements over strong baselines on multiple axes:

Attack Success Rate (ASR): Across all red teaming attacks and models, PLOT yields significant ASR reductions versus baseline SFT, DPO, and AOT. For example, for Llama3.2-3B with DPO alignment, PLOT reduces SFS attack ASR from 25.75% to 16.92%, and GCG attack from 30.08% to 26.83%. Similar improvements hold across help methods and model sizes.
Human Value Metrics: On HH-RLHF and MT-Bench, PLOT outperforms counterparts in Helpfulness, Humanity, and reward-model-based evaluation, e.g., Helpfulness (Reward): 72.14 (PLOT) vs. 70.63 (DPO).
Logical Problem Solving: PLOT achieves higher accuracy on GSM8K, MATH, and better reasoning/coding/STEM scores than all direct preference optimization (DPO, PRO) counterparts.
General Capability Preservation: On AlpacaEval 2.0 (LC Win Rate), PLOT mitigates degradation of out-of-domain generalization seen with DPO, demonstrating enhanced alignment without sacrificing underlying model expressivity.
Figure 1: Comparison of the LC Win Rate shows that $\mathcal{L} = \mathcal{L}_{\text{vanilla}} + \alpha \mathcal{L}_{\text{PLOT}}$ 2 preserves the general capabilities of the model under the original fine-tuning method.

Robustness and Hyperparameter Sensitivity

Red teaming defense stability: As the number of red teaming cases or update steps increases, PLOT’s ASR gains remain robust and stable.

Figure 2: The ASR curves of three models under different case counts for the Zero-Shot method (Left) and varying update steps of GCG (Right). PLOT consistently demonstrates superior defense capabilities and stability compared to DPO.

Hyperparameter stability: Error/ASR metrics with PLOT remain essentially constant across a wide range of scaling factors for $\mathcal{L} = \mathcal{L}_{\text{vanilla}} + \alpha \mathcal{L}_{\text{PLOT}}$ 3; in contrast, DEFT and related methods show strong sensitivity.
Semantic embedding utility: Ablation studies show removing semantic embedding information from $\mathcal{L} = \mathcal{L}_{\text{vanilla}} + \alpha \mathcal{L}_{\text{PLOT}}$ 4 yields substantial ASR degradation, validating the importance of semantic transport for robust defense.
Divergence alternatives: Attempts to replace OT with KL/JS divergences in the distributional loss lead to infinite losses or instability due to vocabulary sparsity. OT yields both numerical and empirical stability.

Theoretical and Practical Implications

PLOT defines a theoretically motivated, domain-agnostic framework for preference alignment that, by design, optimizes the global token distribution with explicit semantics rather than locally per-token. This not only advances practical robustness under adversarial attacks but generalizes to preference categories beyond “harmlessness” (e.g., logical reasoning or problem-solving).

The method is practically efficient (∼2.7% increase in training time over DPO), generalizes across architectures, and is immediately integrable into existing fine-tuning paradigms. The OT-based construction is particularly compelling for LLMs since it avoids known limitations of KL/JS divergence in high-dimensional sparse vocabularies and enables task-adaptive semantic transport via embedding-informed cost matrices.

Future Directions

Advancing PLOT along several dimensions could yield further gains:

Scaling to larger LLMs with more granular vocabulary and embeddings;
Exploring richer or architecture-specific embedding spaces for transport cost construction;
Extending the framework to multi-turn conversational or multimodal preference alignment;
Applying OT-based preference fine-tuning in highly data-limited or non-standard alignment domains.

Conclusion

PLOT presents a theoretically sound, universally applicable approach to preference alignment in LLMs by leveraging Optimal Transport with semantic embedding-informed loss. Empirical findings validate substantial improvements in robustness and preference alignment as well as preservation of general language capabilities, outperforming state-of-the-art DPO- and SFT-based methods with minimal computational overhead. The framework directly addresses critical limitations in existing sequence- and token-level losses, advancing both the methodological and empirical status quo in human-aligned LLM optimization.

Markdown Report Issue