A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation
Abstract: Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations. This paper studies the role of subword segmentation in cross-lingual transfer. We systematically compare the efficacy of several subword methods in promoting synergy and preventing interference across different linguistic typologies. Our findings show that subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning. Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer more significantly than linguistic unrelatedness. Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper looks at how breaking words into smaller pieces (called “subwords”) affects machine translation when many languages are trained together. The goal is to help low‑resource languages (languages with little training data) by sharing helpful patterns from higher‑resource languages, without causing the languages to get in each other’s way.
What questions were the researchers asking?
The authors focused on three simple questions:
- Which ways of splitting words into subwords make different languages help each other more (synergy) and harm each other less (interference) when trained together?
- Which subword methods help the most when you first train on one language and then fine‑tune (adapt) the model to a new language (cross‑lingual transfer)?
- Do language similarities matter more for transfer than how words are written? In particular, does the way languages put spaces between parts of words (their writing rules) help or hurt transfer?
How did they study this?
Think of words as LEGO models and subwords as the bricks. Different “brick sets” can make it easier or harder for languages to share what they’ve learned.
The researchers compared five common subword methods:
- BPE (Byte Pair Encoding): A consistent, step‑by‑step way to build frequent letter chunks into subwords. Like always snapping the most common LEGO bricks together first.
- ULM (Unigram LLM with subword regularization): A method that sometimes uses different valid ways to split the same word during training. Like practicing with several slightly different LEGO brick layouts to become more flexible.
- SSMT (Subword Segmental MT): The model learns how to split words while learning to translate, aiming for what works best for translation.
- OBPE (Overlap‑BPE): A version of BPE that tries to create subwords shared across languages, so they have more common “bricks.”
- XBPE (Extended BPE): When adding a new language later, this method extends the existing “brick set” to include the new language’s pieces.
They ran two kinds of experiments:
- Multilingual training: They trained trilingual models that translated from English to two African languages at once. This let them see if adding a second language helped (synergy) or hurt (interference) the first.
- Cross‑lingual fine‑tuning: They trained a model on English→Language A, then fine‑tuned it on English→Language B to see how well knowledge carried over.
They tested on four South African languages with different properties:
- Siswati (very low‑resource; “conjunctive” writing where parts of words are written together)
- isiXhosa (closely related to Siswati; conjunctive)
- Setswana (somewhat related, but “disjunctive” writing where parts are separated by spaces)
- Afrikaans (not related to Siswati; less disjunctive than Setswana)
To measure translation quality, they used a score that compares how close the machine’s translation is to a human reference at both word and letter levels (you can think of it like a similarity score that rewards getting both words and spellings right).
What did they find, and why does it matter?
Here are the main results and what they mean:
- Subword choice matters a lot.
- ULM (with subword regularization) gave the best synergy during multilingual training. In other words, when training on multiple languages at the same time, ULM helped low‑resource languages like Siswati the most, with only small downsides for the higher‑resource partners.
- BPE worked best for cross‑lingual fine‑tuning. When a model trained on one language was adapted to another, BPE’s stable, consistent splits made it easier to transfer what the model had learned.
- Writing rules can matter more than language family ties.
- Even though Siswati and Setswana are related, they use very different spacing rules (conjunctive vs. disjunctive). This difference made it harder for the model to transfer knowledge between them than between Siswati and Afrikaans (which isn’t related but has more similar spacing).
- This suggests that how words are written (where spaces go) can block transfer more than how related the languages are.
- Which partner helps Siswati the most?
- isiXhosa helped Siswati the most (they are closely related and written similarly).
- Afrikaans helped less (unrelated), but still sometimes better than Setswana because its spacing is less disruptive.
- Setswana helped the least, likely because its highly disjunctive writing breaks words into many tiny pieces that don’t line up well with Siswati’s.
Why this matters:
- If you want to boost a low‑resource language in a multilingual model, using ULM during joint training can give bigger gains.
- If you plan to train on one language and then adapt to another, BPE can make that adaptation smoother.
- When choosing which languages to train together, don’t just look at language families—also consider how the languages write words and use spaces.
What’s the bigger impact?
- Better help for low‑resource languages: Picking the right subword strategy can make a real difference in translation quality for languages with little data, which helps more people access information in their own language.
- Smarter multilingual systems: Designers of translation systems should match the subword method to the training plan:
- Use ULM for joint multilingual training to get more cross‑language synergy.
- Use BPE for cross‑lingual fine‑tuning to transfer knowledge more reliably.
- New research directions: Since writing conventions can block transfer even between related languages, future methods should try to “see past” different spacing rules—so models can recognize shared patterns that are hidden by how words are written.
Overall, the paper shows that small choices in how we split words into pieces can have big effects on how well languages help each other in translation.
Glossary
- Agglutinative: A morphological type where words are formed by stringing together morphemes, each carrying distinct meaning. "Siswati is a low-resource agglutinative language, so effective subword modelling is critical for dealing with the inevitably high proportion of out-of-vocabulary words in held-out datasets."
- Analytic morphology: A morphological type with low morpheme-to-word ratio, relying more on separate words than affixes to express grammatical relations. "Afrikaans is linguistically unrelated to Siswati and also disjunctive, but because of its analytic morphology (lower morpheme-to-word ratio) its written words are sometimes more aligned to those of Siswati"
- BPE: Byte Pair Encoding; a deterministic subword segmentation algorithm that iteratively merges frequent symbol pairs to build a vocabulary. "BPE more effectively facilitates transfer during cross-lingual fine-tuning."
- chrF++: A reference-based MT evaluation metric combining character and word n-grams, robust to segmentation differences. "Instead, we use test set chrF++ to measure performance."
- Cross-entropy loss: A loss function measuring the difference between predicted and true probability distributions, commonly used to evaluate MT models. "\citet{shaham-etal-2023-causes} use test set cross-entropy loss to measure MT performance, but this cannot be reliably used to compare across different subword segmentations."
- Cross-lingual finetuning: Adapting a model trained on one language pair to another, leveraging learned representations for transfer. "In the cross-lingual finetuning experiments we finetune pretrained bilingual MT models on new languages."
- Cross-lingual subword overlap: The extent to which subword units are shared across languages in a multilingual vocabulary, affecting transfer. "However, multilingual vocabularies are known to affect cross-lingual transfer through factors such as cross-lingual subword overlap"
- Cross-lingual transfer: The phenomenon where knowledge learned from one language improves performance in another. "This paper studies the role of subword segmentation in cross-lingual transfer."
- Deterministic segmenter: A subword tokenizer that produces a single, fixed segmentation for any input text. "We use it as a deterministic segmenter."
- Disjunctive orthography: An orthographic convention where morphemes are written as separate space-delimited tokens, increasing token granularity. "Disjunctive orthographies write a single linguistic word as multiple orthographic words (e.g. in Setswana prefixal morphemes are space-separeted from verbal roots)."
- FLORES: A benchmark dataset for evaluating low-resource and multilingual machine translation. "validate and test on FLORES"
- Interference: Negative cross-lingual interaction where multilingual training degrades performance on a language. "There is a tradeoff between maximising positive cross-lingual transfer (also known as synergy) while minimising negative cross-lingual interaction (also known as interference)."
- Language sampling temperature: A hyperparameter controlling how training data from multiple languages are sampled to balance exposure. "We use a language sampling temperature of to balance exposure to low-resource and high-resource languages."
- Low-resource language: A language with limited parallel data for training MT systems. "Low-resource languages stand to benefit most from multilingual modelling."
- Morphological typology: The classification of languages based on how they form words from morphemes (e.g., analytic, agglutinative). "with particular focus on factors related to subword structure like morphological typology and orthographic word boundary conventions"
- Multilingual modelling: Training a single model to handle multiple languages, sharing parameters and vocabularies. "Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations."
- OBPE: Overlap-Based BPE; a variant of BPE that increases shared subword units across languages to improve transfer. "OBPE modifies BPE to boost subword overlap among languages in multilingual vocabularies."
- Orthographic word boundary conventions: Language-specific rules about how morphemes and words are separated in writing, impacting pre-tokenization and subwording. "Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer"
- Pre-tokenisation: The initial splitting of text into tokens before applying subword segmentation. "Orthographic word boundaries determine the pre-tokenisation of text before subword segmenters are applied"
- Probabilistic segmenter: A tokenizer that samples from multiple valid segmentations according to a learned distribution, enabling regularization. "ULM learns segmentation to optimise a unigram LLM and can be used as a probabilistic segmenter"
- SentencePiece: A language-independent subword tokenizer commonly used to build shared multilingual vocabularies. "using the same sentencepiece vocabulary across all models."
- Shared parameter space: The common set of model weights used across multiple languages in a multilingual system. "However, increasing multilinguality in a limited shared parameter space can lead to suboptimal performance for high-resource languages"
- SSMT: Subword Segmental Machine Translation; a model that jointly learns segmentation and target generation to optimise MT performance. "SSMT is a subword segmental MT model which learns subword segmentation jointly during MT training, with the goal of learning subwords that optimise MT performance."
- Subword regularisation: Training-time stochastic variation of segmentations to improve robustness and generalisation. "Our findings show that subword regularisation boosts synergy in multilingual modelling"
- Subword segmentation: The process of breaking words into smaller units (subwords) to handle rare or unseen words in MT. "One aspect their study failed to consider is subword segmentation."
- Synergy: Positive cross-lingual transfer where multilingual training improves performance on a language. "There is a tradeoff between maximising positive cross-lingual transfer (also known as synergy) while minimising negative cross-lingual interaction (also known as interference)."
- Transformer-base: A standard configuration of the Transformer architecture (as implemented in fairseq) used as a baseline. "We use the model size and training hyperparameters of the fairseq transformer-base architecture."
- ULM: Unigram LLM tokenizer; a subword method that learns a probabilistic segmentation optimising a unigram LM objective. "ULM learns segmentation to optimise a unigram LLM"
- Unigram LLM: A model assuming independence between tokens, used to score and sample subword segmentations. "ULM learns segmentation to optimise a unigram LLM"
- WMT22: A shared task/dataset from the Workshop on Machine Translation 2022 used for training. "We train models on WMT22 data"
- XBPE: Extended BPE; a technique to expand a pretrained model’s BPE vocabulary for new languages to improve transfer. "XBPE extends the BPE vocabulary of a pretrained model to include BPE subwords of a new translation direction."
Collections
Sign up for free to add this paper to one or more collections.