Model Merging: Foundations and Algorithms
Abstract: Modern deep learning usually treats models as separate artifacts: trained independently, specialized for particular purposes, and replaced when improved versions appear. This thesis studies model merging as an alternative paradigm: combining independently trained neural networks directly in weight space, with little or no optimization and without requiring access to the original training data. The thesis considers two main regimes. In the single-task setting, where models share an objective but differ in initialization, we introduce C$2$M$3$, a cycle-consistent merging algorithm based on Frank-Wolfe optimization. C$2$M$3$ aligns multiple networks into a shared, reference-free parameter space, making weight averaging meaningful without privileging any individual model. In the multi-task setting, where models are fine-tuned for different downstream tasks from a common pretrained initialization, we first develop a theoretical account of task vectors as approximate gradients. This explains both the effectiveness and the limitations of task arithmetic. Building on this view, we show that task vectors inherit the low-rank structure of gradients and introduce Task Singular Vectors (TSV), a decomposition that enables compression and interference reduction through TSV-Merge. We then present MASS, an input-adaptive routing method that uses TSV geometry to select task-relevant subspaces at inference time. Finally, we introduce MERGE$3$, an evolutionary merging framework that uses Item Response Theory to reduce evaluation costs by up to 50$\times$ while preserving solution quality. Together, these contributions provide theoretical and algorithmic foundations for model merging, supporting a paradigm in which learned capabilities can be composed, reused, and extended across models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about
This thesis looks at a new way to combine the “brains” of different AI models. Instead of training one giant model from scratch or retraining with lots of data, it shows how to merge already-trained neural networks directly by combining their internal settings (their weights). The goal is to reuse and mix learned skills quickly, cheaply, and without extra training data.
The main questions the paper asks
- Can we safely “average” or combine the weights of different models so the result still works well?
- How can we merge models that learned the same task separately (like two students who studied the same exam but started from different notes)?
- How can we merge models that learned different tasks (like a math expert and a history expert) without them getting in each other’s way?
- Can we do all this with little or no extra training and with low cost?
How the research was done (in plain language)
The thesis studies two situations and proposes tools for each:
1) Single-task merging: many models, same goal
Picture several people solving the same puzzle but each arranged the pieces differently. If you try to average their solutions piece-by-piece, it might not make sense because the pieces don’t line up.
- Problem: Different models label their inner parts (neurons) in different ways, like using different names for the same puzzle pieces. So simple averaging can fail.
- Solution: The thesis introduces an algorithm called C2M3. “Cycle-consistent” means it makes the pairwise matchings among many models agree with each other in loops (no contradictions when you go around the “cycle”). It uses a classic optimization idea called Frank–Wolfe, which is like moving step-by-step toward a good blend while staying within safe limits. The result is that it aligns all the models into a shared “coordinate system” so that averaging their weights becomes meaningful, without picking one model as the boss or anchor.
Analogy: Before averaging recipes from different chefs, you first make sure everyone is talking about the same ingredients in the same order. Then an average of “2 cups flour + 1 cup sugar” with “2.5 cups flour + 0.5 cups sugar” makes sense.
2) Multi-task merging: many models, different goals
Now imagine one model learned math, another learned history, another learned drawing. How do we combine their skills without them interfering?
- Task vectors: For each task, take the difference between the fine-tuned model and its original base model (before it learned that task). This “difference” is the task’s “skill vector.” It’s like noting how the base recipe was changed to make a new dish.
- Gradient view: The thesis shows that these task vectors behave like gradient steps (directions a model moves during training). This explains why “task arithmetic” (adding or subtracting skill vectors to mix abilities) sometimes works—and why it sometimes doesn’t.
- Low-rank structure: In everyday terms, even though a model has millions of numbers, the real “action” often happens in just a few important directions. The thesis formalizes this with Task Singular Vectors (TSV), a way to break a task’s changes into its most important directions (like capturing the main notes of a song instead of every tiny sound).
- TSV-Merge: Using TSV, the method compresses each task (fewer numbers) and merges tasks while reducing interference—so skills don’t overwrite each other.
- MASS (an adaptive router): At inference time (when the model is answering a question), MASS looks at the input and decides which small set of task-directions (subspaces) to use. Think of a traffic router that sends cars onto the right lanes, avoiding jams and speeding things up.
- MERGE3 (an evolutionary framework with Item Response Theory): To search for good merges, it uses an “evolutionary” process (try variations, keep the best, repeat). To save time, it borrows a trick from educational testing called Item Response Theory (IRT), which quickly estimates how good a model is by asking only the most informative “questions.” This cuts evaluation cost by up to 50× while keeping quality.
The key findings and why they matter
- Single-task: C2M3 aligns multiple models into a shared space so that simple weight averaging works well without choosing a single reference model. This makes combining many independently trained models safer and more reliable.
- Theory of task vectors: Task vectors are closely tied to training gradients. This connection clarifies when adding/subtracting task vectors (task arithmetic) should work, and when it may fail.
- Low-rank structure is real and useful: Because task changes mostly live in a few key directions, TSV can:
- Compress models (fewer numbers, same skill),
- Reduce interference when merging different tasks (skills don’t “fight” as much),
- Enable smarter, input-aware routing (MASS) so the model only uses the relevant skills.
- Practical speed-ups: MERGE3 uses IRT to evaluate candidate merges far faster (up to 50× cheaper) without sacrificing performance, making large-scale merging practical.
These results matter because they let us mix and match AI skills without retraining on tons of data, which saves compute, time, and energy.
What this could change in the real world
- Faster AI development: Teams can combine the strengths of many specialized models into one, without starting over.
- Lower cost and greener AI: Less training and less evaluation mean less energy use and lower bills.
- Privacy-friendly reuse: Merging in “weight space” can avoid using the original training data, which is helpful when data is private or not available.
- Flexible AI systems: We can build libraries of reusable “skills” (task vectors/TSVs) and plug them together for new applications, from language understanding to vision and beyond.
- Better on-device or edge AI: Compression and routing help run powerful, multi-skill models on smaller devices.
In short, the thesis builds both the theory and the tools for turning many separate trained models into a single, capable model—safely, efficiently, and with little or no extra training data.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of concrete gaps and open questions left unresolved by the paper. Each point is phrased to guide actionable future work.
Overall foundations and scope
- Formal conditions for when weight-space merging is preferable to ensembling or distillation, including performance–compute trade-offs and scenarios where merging provably cannot outperform alternatives.
- Scaling behavior of merging algorithms when applied to foundation-scale models (e.g., 7B–70B parameters), including memory footprint, distributed execution, and communication overheads.
- Systematic evaluation across modalities (vision, NLP, speech) and architectures (CNNs, ViTs, LLMs, diffusion, MoE), with ablations isolating architecture-specific failure modes.
- Robustness and safety: how merging interacts with calibration, uncertainty estimation, biases, and the propagation or attenuation of backdoors/toxic behaviors from constituent models.
- Legal/ethical implications and provenance tracking when composing models with different licenses or training data mixtures; mechanisms to audit lineage and capability inheritance.
Single-task model merging (C2M3)
- Convergence guarantees of the Frank–Wolfe-based cycle-consistent alignment in non-convex, permutation-symmetric landscapes; characterization of stationary points and conditions for global optimality.
- Sensitivity analysis to non-permutation symmetries (e.g., scaling, orthogonal transforms, residual connections, LayerNorm statistics), and extensions that jointly account for these symmetries during alignment.
- Quantitative bounds linking alignment error (e.g., fraction of mismatched channels/heads) to post-merge loss increases; diagnostic metrics and stopping criteria for reliable merging.
- Complexity analysis and empirical scaling with number of models, layers, and width (e.g., quadratic vs. linear in model count for synchronization); pruning or sketching methods to keep alignment tractable at scale.
- Handling normalization layers (BatchNorm/LayerNorm) and attention-specific components during merging (e.g., Q/K/V head permutations, rotary embeddings), including principled rescaling or re-centering procedures.
- Applicability to training heterogeneity: can merge checkpoints trained with different optimizers, learning-rate schedules, or data augmentations without rebasining?
- Function-space equivalence: criteria ensuring the shared parameter space is not only numerically aligned but functionally equivalent on input distributions; techniques to detect function misalignment pre-merge.
Task vectors and low-rank structure (TSV, TSV-Merge)
- Assumptions underlying the gradient-based interpretation of task vectors (e.g., small learning rates, local quadraticity) and their validity under large-step fine-tuning, extensive LoRA updates, or sharp curvature.
- Empirical and theoretical characterization of when gradients/task vectors are “sufficiently low-rank,” including task- and layer-wise rank distributions and failure cases where rank inflation occurs.
- Automatic layer-wise rank selection and model selection criteria (e.g., information criteria, stability-based selection) with guarantees on interference–performance trade-offs.
- Generalization bounds linking TSV rank and interference measures (e.g., STI) to post-merge task performance and OOD robustness.
- Transportability of task vectors across different pretrains/architectures (e.g., LLaMA→Mistral, ViT→ConvNet) and mechanisms to align or remap subspaces for cross-base merging.
- Compositionality limits: scaling laws for interference as the number of tasks grows; strategies for subspace packing, clustering, or sparsification that maintain accuracy with dozens/hundreds of tasks.
- Interactions with parameter-efficient fine-tuning (LoRA, adapters): how to best derive and compose TSVs when only low-rank adapters or sparse masks are available.
Input-adaptive routing (MASS)
- Theoretical guarantees for routing consistency and error bounds: conditions under which TSV-geometry-based gating is Bayes-consistent or achieves low regret.
- Robustness of routing under distribution shift, noisy inputs, and adversarial manipulation; detection and fallback strategies when routing confidence is low.
- Latency and memory overheads from per-task subspace projections; methods to compress or amortize routing costs (e.g., learned proxies, hashing, or shared subspace hierarchies).
- Applicability beyond classification (e.g., generative LLMs, diffusion models, multi-turn dialogue): how to define and exploit TSV geometry for sequence generation and structured outputs.
- Online/continual routing: dynamic incorporation of new tasks and subspaces without reprocessing prior tasks; criteria to trigger subspace updates or merges.
Evolutionary merging with IRT (MERGE)
- Identifiability and calibration of IRT parameters when “respondents” are models rather than humans, including the appropriate number of ability dimensions and priors (e.g., 1PL/2PL/3PL/MD-MIRT choices).
- Extension of IRT-based evaluation beyond binary correctness to continuous or structured metrics (e.g., BLEU, ROUGE, exact match vs. partial credit), requiring graded or nominal response models.
- Sample efficiency guarantees: bounds showing how many items are needed to preserve model ranking and guide evolutionary search within a target error tolerance.
- Selection bias and representativeness of the item pool: procedures to curate or adaptively update items so that estimated fitness correlates with full-benchmark performance under domain shift.
- Overfitting risks to the IRT-derived evaluation set during evolutionary search; mechanisms for cross-validation, holdouts, or exploration bonuses to maintain generalization.
- Integration with multi-objective optimization (e.g., accuracy, interference, calibration, fairness): designing IRT-like latent traits that faithfully reflect multi-criteria performance.
Evaluation methodology and reproducibility
- Standardized benchmarks and protocols for multi-task merging that jointly measure accuracy, interference, calibration, and compute/latency, including unified datasets spanning modalities.
- Ablations disentangling contributions of alignment, low-rank truncation, and routing; sensitivity to hyperparameters (e.g., merge coefficients, ranks, gating thresholds).
- Reproducibility at scale: reference implementations with memory- and compute-aware defaults, and guidelines for merging extremely large checkpoints (e.g., sharded weights, mixed precision, quantization-aware merging).
Security, fairness, and governance
- Detection and mitigation of undesirable capability transfer (e.g., backdoors, prompt injection behaviors) during merging; certification tests and repair strategies in weight space.
- Fairness impacts when merging tasks trained on demographically skewed datasets; metrics and constraints to prevent amplification of biases through composition.
- Provenance-preserving metadata and “merge manifests” to track sources, licenses, and known limitations, enabling responsible redistribution of merged models.
Practical Applications
Summary
Based on the thesis “Model Merging: Foundations and Algorithms,” the following applications derive from its core contributions: (1) cycle-consistent, anchor-free single-task merging via C2M3; (2) a gradient-based theory of task vectors and their low-rank structure; (3) Task Singular Vectors (TSV) for compression and interference reduction (TSV-Merge); (4) MASS, an input-adaptive router leveraging TSV geometry; and (5) MERGE³, an evolutionary merging framework using Item Response Theory (IRT) to cut evaluation costs.
Below are actionable use cases grouped by deployment horizon. Each item notes sectors, likely tools/workflows, and key assumptions/dependencies.
Immediate Applications
- Model soups without data via alignment-first averaging (C2M3)
- Sectors: software/AI platforms, MLOps, cloud.
- What: Merge independently trained models (same architecture/seed variance) by aligning permutation symmetries and averaging, yielding “free” ensemble gains without added inference cost.
- Tools/workflows: CI/CD step that merges checkpoints from multiple training runs; checkpoint registry plugin to auto-align-and-merge top-k runs; A/B “merge vs best” promotion gates.
- Assumptions/dependencies: identical architectures and compatible layer shapes; models trained on the same task/objective; alignment solver scales to model size; license compatibility across checkpoints.
- Cross-silo model aggregation in federated or privacy-restricted settings (C2M3)
- Sectors: healthcare, finance, IoT/edge, public sector.
- What: Server-side weight aggregation across client models without collecting raw data, reducing communication and privacy risk versus centralized training.
- Tools/workflows: FL server plugin for permutation alignment + Frank–Wolfe-based merging; periodic aggregation with audit logs.
- Assumptions/dependencies: clients share a common objective and architecture; distribution shift remains bounded; robust alignment under heterogeneous client training.
- Shipping multi-skill models as “skill packs” (task vectors + TSV-Merge)
- Sectors: LLM platforms, creative AI, enterprise AI.
- What: Distribute and compose capabilities as lightweight deltas decomposed into low-rank TSVs to reduce interference when adding skills (e.g., code, math, safety).
- Tools/workflows: “SkillStore” of TSV packs; model hub metadata for base-model compatibility; CLI to compose packs and run interference checks.
- Assumptions/dependencies: skills fine-tuned from the same pretrained base; low-rank approximation preserves salient behavior; legal/IP compatibility; red-teaming for safety retention.
- Memory- and latency-efficient multi-task deployment on edge (TSV-based compression)
- Sectors: mobile, embedded/IoT, robotics.
- What: Compress per-task deltas into low-rank subspaces, ship a single base model + compact TSVs, and selectively activate relevant components at inference.
- Tools/workflows: build-time TSV extraction; on-device conditional execution; runtime toggles for task subspaces.
- Assumptions/dependencies: hardware/framework support for conditional compute and low-rank kernels; stable task performance under compression; battery/latency constraints met.
- Safer multi-skill integration via interference-aware merging (TSV-Merge)
- Sectors: consumer AI, enterprise compliance.
- What: Merge new capabilities with minimal degradation to safety/alignment tasks by minimizing cross-task interference using TSV geometry.
- Tools/workflows: pre-merge interference diagnostics (e.g., STI metrics); automatic rank selection and regularization; post-merge safety tests.
- Assumptions/dependencies: safety behavior is represented in the task vectors; interference metrics correlate with risk; comprehensive evaluation sets.
- Data-less capability fusion for diffusion and vision models
- Sectors: media/entertainment, design tools, retail visualization.
- What: Combine specialized diffusion/vision experts (e.g., style, concept, segmentation) into a single deployable model using alignment + task vector arithmetic.
- Tools/workflows: “concept pack” TSVs; style/control sliders influence rank/scale per concept; export to on-prem tools.
- Assumptions/dependencies: same base checkpoint; compatible training scales/time; quality checks for artifacting/style bleed.
- Evaluation cost reduction in AutoML and model selection (MERGE³ + IRT)
- Sectors: MLOps, benchmarking services, academic labs.
- What: Use IRT-calibrated “hard” items to estimate performance, reducing evaluation budget by up to ~50× while maintaining ranking fidelity.
- Tools/workflows: evaluator service that maintains IRT-calibrated item banks; integration into hyperparameter search and evolutionary merging loops.
- Assumptions/dependencies: sufficient historical model responses to calibrate item difficulties/discriminations; domain transferability of item parameters; careful handling of non-binary metrics.
- Lightweight routing among a few experts (MASS-lite)
- Sectors: SaaS AI, customer support workflows.
- What: For a small catalogue of capabilities, route inputs to the most relevant subspace using TSV geometry for improved accuracy/efficiency over static ensembles.
- Tools/workflows: router module emitting per-task gates; monitoring misroute rates; fallback to generalist path.
- Assumptions/dependencies: well-separated task subspaces; stable feature extraction for gating; bounded catalog size to keep overhead low.
- Academic prototyping for task arithmetic and representation studies
- Sectors: academia/research.
- What: Faster experiments on composition, emergent abilities, and representation similarity by merging fine-tuned checkpoints and analyzing TSVs.
- Tools/workflows: notebooks for TSV extraction, rank sweeps, interference plots; reproducible compose/evaluate pipelines.
- Assumptions/dependencies: availability of open checkpoints from common bases; consistent evaluation suites.
Long-Term Applications
- Large-scale expert marketplaces and dynamic routing (MASS at scale)
- Sectors: AI platforms/marketplaces, cloud serving.
- What: Serve thousands of “skill packs” and dynamically route per-input through relevant subspaces, paying only for activated compute.
- Tools/workflows: router training with online feedback; caching/popularity-based preloading; SLA-aware conditional compute orchestration.
- Assumptions/dependencies: scalable router accuracy; latency budgets with conditional execution; robust isolation of unsafe interactions among experts; governance for third-party deltas.
- Cross-organization skill exchange with provenance and compliance
- Sectors: enterprise software, regulated industries.
- What: Standardize packaging and verification of task vectors/TSVs with cryptographic provenance, EULAs, and compliance checks.
- Tools/workflows: registry with SBOM-like manifests for weights; reproducible merge recipes; automated license and IP scanners.
- Assumptions/dependencies: community standards for delta formats and metadata; legal frameworks for weight sharing; secure enclaves for sensitive merges.
- Privacy-preserving healthcare and finance model composition
- Sectors: healthcare, insurance, banking.
- What: Merge institution-specific deltas into common bases without data sharing, enabling pooled performance while preserving privacy.
- Tools/workflows: hospital/branch sites export TSVs; central authority performs compliant merges; validation on audited IRT-calibrated test banks.
- Assumptions/dependencies: rigorous clinical/financial validation; regulatory acceptance of merging as a development pathway; drift monitoring; robust anonymization of any auxiliary signals.
- Continual learning via periodic merging and interference control
- Sectors: autonomy, robotics, cybersecurity.
- What: Accumulate new skills over time by adding low-rank deltas and rebalancing interference, reducing catastrophic forgetting without full retraining.
- Tools/workflows: scheduled merge cycles; “skill health” dashboards; automatic rank reallocation based on usage.
- Assumptions/dependencies: predictable interference under accumulation; conflict detection/resolution across many tasks; stability under distribution shifts.
- RL/robotics policy composition (policy TSVs and MASS routing)
- Sectors: industrial automation, home robotics, logistics.
- What: Combine task-specialized policies (e.g., grasp, navigation) into unified controllers using low-rank deltas and routed execution.
- Tools/workflows: sim2real pipelines exporting deltas; safety envelopes for composite policies; scenario-based IRT for task difficulty profiling.
- Assumptions/dependencies: stability of weight-space merges in non-stationary RL; safe routing under changing dynamics; extensive safety validation.
- Energy-efficient AI on heterogeneous hardware via conditional subspaces
- Sectors: mobile, AR/VR, edge computing.
- What: Exploit conditional computation to activate minimal subspaces per input, reducing FLOPs and energy.
- Tools/workflows: compiler support for dynamic sparsity/low-rank kernels; per-silicon tuning; on-device power-aware routing policies.
- Assumptions/dependencies: hardware/runtime support for fine-grained activation; robust latency under branching; model accuracy preserved under aggressive sparsity.
- Public-sector benchmarking and procurement using IRT
- Sectors: government, standards bodies.
- What: Adopt IRT-based test construction for fair, compute-efficient evaluation in tenders and audits (e.g., multilingual IR, safety).
- Tools/workflows: transparent item banks with published invariance checks; anchoring procedures across years; audit trails.
- Assumptions/dependencies: stakeholder acceptance; safeguards against gaming; periodic recalibration to prevent overfitting to “hard” items.
- Automated remediation for merged-model safety and bias
- Sectors: consumer AI, HR tech, edtech.
- What: Detect and mitigate bias/safety regressions introduced by merges using interference analyses and targeted counter-deltas.
- Tools/workflows: pre-merge risk prediction using TSV overlap; post-merge bias probes; automated generation of corrective low-rank updates.
- Assumptions/dependencies: reliable correlation between TSV overlap and risk; availability of high-quality bias/safety benchmarks; governance for automated patches.
- Cross-base task vector transport and interoperability
- Sectors: open-source AI ecosystems, vendors.
- What: Transport task vectors across different base models to enable broader reuse (e.g., from LLaMA to Mistral families).
- Tools/workflows: learned transport maps; gradient-sign masking or alignment bridges; validation harnesses for fidelity.
- Assumptions/dependencies: theoretical and empirical guarantees for transport fidelity; risk of semantic drift; license compatibility across bases.
- Formal guarantees and standards for merge safety
- Sectors: regulators, certification bodies.
- What: Develop certifiable bounds on performance degradation, safety preservation, and interference when merging.
- Tools/workflows: conformance tests; formal verification for restricted architectures; certification programs for “merge-ready” models.
- Assumptions/dependencies: tractable verification techniques for deep models; sector-specific acceptance criteria; scalability to large architectures.
Notes on Cross-Cutting Dependencies
- Common pretrained base: Most task-vector/TSV methods assume skills originate from the same base model; otherwise, require transport/alignment.
- Architectural compatibility: C2M3 and TSVs assume identical layer shapes and permutations; adapters/LoRA-like layers reduce constraints but still need consistent injection points.
- Low-rank validity: Performance relies on gradients and task deltas being approximately low-rank; rank selection and normalization are critical.
- Routing correctness: MASS requires reliable per-input signals to avoid misrouting; conservative fallbacks and monitoring reduce risk.
- Evaluation calibration: IRT needs enough diverse model responses to calibrate item parameters; periodic recalibration prevents overfitting.
- Legal and ethical constraints: Weight mixing may violate licenses or introduce unsafe behaviors; thorough legal review and safety audits are necessary.
Glossary
- cycle-consistent: A property where operations remain consistent when composed around a cycle; used to ensure consistent merges among multiple models. "a cycle-consistent merging algorithm"
- evolutionary merging framework: A model-merging approach that uses evolutionary algorithms to search for high-performing combinations. "an evolutionary merging framework"
- Euclidean residual: The Euclidean distance between a vector and its projection onto a subspace, used as a routing or relevance score. "Euclidean residual: "
- Frank-Wolfe optimization: A first-order algorithm for constrained optimization that iteratively solves linear approximations of the objective. "Frank-Wolfe optimization"
- Frobenius inner product: The sum of element-wise products between two matrices; equivalent to . "Frobenius inner product between matrices and "
- Frobenius norm: A matrix norm equal to the square root of the sum of squares of all entries. "Frobenius norm of matrix "
- gating function: A function that selects or weights tasks/experts based on the input. "Per-task gating function for input "
- gradient-based interpretation: Understanding parameter differences or structures as arising from gradients of a loss function. "a gradient-based interpretation"
- Hessian: The matrix of second derivatives of a scalar function, indicating local curvature of the loss landscape. "Hessian of empirical loss for task "
- input-adaptive routing mechanism: A system that directs each input through task-relevant components based on input-specific criteria. "an input-adaptive routing mechanism"
- Item Response Theory: A probabilistic framework modeling item difficulty, discrimination, and respondent ability, used here to reduce evaluation cost. "Item Response Theory"
- Linear Assignment Problem: The problem of assigning items to agents to minimize total cost; often solved to align model components. "Linear Assignment Problem solver"
- low-rank structure: When a matrix can be well-approximated by a small number of singular components. "low-rank structure"
- orthogonal projection: Projection of a vector onto a subspace that minimizes Euclidean distance. "Orthogonal projection of onto subspace spanned by columns of "
- permutation alignment: Aligning neurons/filters across networks by permuting units so structures match. "Permutation alignment"
- permutation matrix: A binary square matrix representing a permutation of coordinates. "Permutation matrix at layer "
- reference-free aggregation point: A shared parameter space used for aggregation that does not privilege any single model as a reference. "a reference-free aggregation point"
- singular value decomposition (SVD): Factorization of a matrix into revealing singular values and vectors. "SVD of task matrix: "
- Singular Task Interference: A measure of task interference computed in the singular-vector space of task matrices. "Singular Task Interference measure"
- spectral norm: The largest singular value of a matrix; the operator 2-norm. "Spectral norm of matrix "
- subspace: A linear subset of a vector space closed under addition and scalar multiplication. "Subspace spanned by the columns of "
- task arithmetic: The heuristic of composing capabilities by adding or subtracting task vectors. "task arithmetic"
- Task Singular Vectors (TSV): Singular vectors derived from task matrices capturing dominant task directions for compression and deconfliction. "Task Singular Vectors (TSV), a decomposition that supports both model compression and interference reduction in TSV-Merge."
- task vector: The parameter difference between a fine-tuned model and its pre-trained initialization. "task vectors, the parameter differences between a fine-tuned model and its pretrained initialization"
- vectorization: Converting a matrix into a vector by stacking its columns. "Vectorization of matrix (stacking columns)"
- weight averaging: Averaging parameters across multiple models to combine their capabilities. "making weight averaging meaningful"
- weight space: The space of all parameter values of a model. "directly in weight space"
Collections
Sign up for free to add this paper to one or more collections.