Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Published 29 Mar 2024 in cs.CL | (2403.20157v1)

Abstract: Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations. This paper studies the role of subword segmentation in cross-lingual transfer. We systematically compare the efficacy of several subword methods in promoting synergy and preventing interference across different linguistic typologies. Our findings show that subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning. Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer more significantly than linguistic unrelatedness. Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling.

Authors (2)

Summary

  • The paper examines how varying subword segmentation approaches affect cross-lingual transfer, demonstrating BPE’s strength in finetuning.
  • It shows that probabilistic methods like ULM enhance multilingual synergy while reducing interference from dominant high-resource languages.
  • Findings highlight that orthographic differences, rather than linguistic unrelatedness, significantly hinder effective cross-lingual transfer in low-resource settings.

Overview

The paper "A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation" (2403.20157) investigates the effect of subword segmentation methods on cross-lingual transfer in multilingual machine translation (MT). The study specifically explores how different subword techniques promote synergy and minimize interference among linguistically diverse languages, with a focus on translation tasks involving low-resource languages.

Introduction

Multilingual MT models leverage shared subword vocabularies to facilitate positive transfer from high-resource languages to low-resource ones. However, maximizing synergy while minimizing interference remains a challenge, especially given the constrained shared parameter space. This paper systematically compares subword segmentation techniques in terms of their ability to induce synergy, mitigate interference, and support cross-lingual finetuning. The authors highlight that orthographic word boundary conventions—more than linguistic unrelatedness—pose significant obstacles to cross-lingual transfer.

Previous research has documented phenomena such as synergy and interference in multilingual MT. While factors such as model size, data size, and language relatedness have been studied, the role of subword segmentation remains underexplored. Multilingual models benefit from overlapping subword representations, but must contend with under-representation issues for low-resource languages. This paper extends the inquiry into subword segmentation techniques, underscoring their impact on cross-lingual transfer.

Methodology

The research involves experiments on bilingual and trilingual models using various South African languages to test cross-lingual interactions. Siswati, a low-resource language, serves as the focal point for these investigations. The study evaluates the performance of different subword methods across linguistic contexts, considering factors like morphological typology and orthographic conventions affecting multilingual MT performance. Experiments are conducted on WMT22 data and validated using FLORES datasets.

Experimental Setup

Five subword segmentation methods are compared:

  1. BPE: A deterministic segmenter based on byte-pair encoding.
  2. ULM: A probabilistic segmenter optimizing a unigram LLM for regularization.
  3. SSMT: A model that learns subword segmentation during MT training.
  4. OBPE: Modifies BPE to increase subword overlap among languages.
  5. XBPE: An extended BPE vocabulary for facilitating cross-lingual transfer during finetuning.

These methods are applied to test and evaluate the cross-lingual transfer potential among languages that vary in linguistic and orthographic features.

Results and Discussion

The experiments yield several key insights:

  • Synergy and Interference: ULM effectively promotes synergy in multilingual settings, demonstrating resilience against interference from higher-resource languages. This indicates that subword regularization enhances robustness in shared multilingual vocabularies.
  • Cross-Lingual Transfer: BPE facilitates greater cross-lingual transfer during finetuning compared to ULM, which struggles with probabilistic segmentations misaligned between pretraining and finetuning languages.
  • Impact of Linguistic Typology: Notably, differences in orthographic conventions among languages can hinder cross-lingual transfer more significantly than linguistic unrelatedness, as seen in the interactions between Siswati and Setswana.

Conclusion

The paper establishes that subword segmentation crucially influences cross-lingual dynamics in multilingual MT. Deterministic segmentation methods like BPE enhance cross-lingual transfer during finetuning, while subword regularization techniques like ULM offer benefits in multilingual modeling. The findings suggest that orthographic conventions are critical in shaping cross-lingual transfer outcomes, warranting further exploration into methods that address these surface-level differences in MT models. Future work should focus on enabling models to transcend orthographic boundaries to leverage deeper linguistic similarities.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper looks at how breaking words into smaller pieces (called “subwords”) affects machine translation when many languages are trained together. The goal is to help low‑resource languages (languages with little training data) by sharing helpful patterns from higher‑resource languages, without causing the languages to get in each other’s way.

What questions were the researchers asking?

The authors focused on three simple questions:

  • Which ways of splitting words into subwords make different languages help each other more (synergy) and harm each other less (interference) when trained together?
  • Which subword methods help the most when you first train on one language and then fine‑tune (adapt) the model to a new language (cross‑lingual transfer)?
  • Do language similarities matter more for transfer than how words are written? In particular, does the way languages put spaces between parts of words (their writing rules) help or hurt transfer?

How did they study this?

Think of words as LEGO models and subwords as the bricks. Different “brick sets” can make it easier or harder for languages to share what they’ve learned.

The researchers compared five common subword methods:

  • BPE (Byte Pair Encoding): A consistent, step‑by‑step way to build frequent letter chunks into subwords. Like always snapping the most common LEGO bricks together first.
  • ULM (Unigram LLM with subword regularization): A method that sometimes uses different valid ways to split the same word during training. Like practicing with several slightly different LEGO brick layouts to become more flexible.
  • SSMT (Subword Segmental MT): The model learns how to split words while learning to translate, aiming for what works best for translation.
  • OBPE (Overlap‑BPE): A version of BPE that tries to create subwords shared across languages, so they have more common “bricks.”
  • XBPE (Extended BPE): When adding a new language later, this method extends the existing “brick set” to include the new language’s pieces.

They ran two kinds of experiments:

  • Multilingual training: They trained trilingual models that translated from English to two African languages at once. This let them see if adding a second language helped (synergy) or hurt (interference) the first.
  • Cross‑lingual fine‑tuning: They trained a model on English→Language A, then fine‑tuned it on English→Language B to see how well knowledge carried over.

They tested on four South African languages with different properties:

  • Siswati (very low‑resource; “conjunctive” writing where parts of words are written together)
  • isiXhosa (closely related to Siswati; conjunctive)
  • Setswana (somewhat related, but “disjunctive” writing where parts are separated by spaces)
  • Afrikaans (not related to Siswati; less disjunctive than Setswana)

To measure translation quality, they used a score that compares how close the machine’s translation is to a human reference at both word and letter levels (you can think of it like a similarity score that rewards getting both words and spellings right).

What did they find, and why does it matter?

Here are the main results and what they mean:

  • Subword choice matters a lot.
    • ULM (with subword regularization) gave the best synergy during multilingual training. In other words, when training on multiple languages at the same time, ULM helped low‑resource languages like Siswati the most, with only small downsides for the higher‑resource partners.
    • BPE worked best for cross‑lingual fine‑tuning. When a model trained on one language was adapted to another, BPE’s stable, consistent splits made it easier to transfer what the model had learned.
  • Writing rules can matter more than language family ties.
    • Even though Siswati and Setswana are related, they use very different spacing rules (conjunctive vs. disjunctive). This difference made it harder for the model to transfer knowledge between them than between Siswati and Afrikaans (which isn’t related but has more similar spacing).
    • This suggests that how words are written (where spaces go) can block transfer more than how related the languages are.
  • Which partner helps Siswati the most?
    • isiXhosa helped Siswati the most (they are closely related and written similarly).
    • Afrikaans helped less (unrelated), but still sometimes better than Setswana because its spacing is less disruptive.
    • Setswana helped the least, likely because its highly disjunctive writing breaks words into many tiny pieces that don’t line up well with Siswati’s.

Why this matters:

  • If you want to boost a low‑resource language in a multilingual model, using ULM during joint training can give bigger gains.
  • If you plan to train on one language and then adapt to another, BPE can make that adaptation smoother.
  • When choosing which languages to train together, don’t just look at language families—also consider how the languages write words and use spaces.

What’s the bigger impact?

  • Better help for low‑resource languages: Picking the right subword strategy can make a real difference in translation quality for languages with little data, which helps more people access information in their own language.
  • Smarter multilingual systems: Designers of translation systems should match the subword method to the training plan:
    • Use ULM for joint multilingual training to get more cross‑language synergy.
    • Use BPE for cross‑lingual fine‑tuning to transfer knowledge more reliably.
  • New research directions: Since writing conventions can block transfer even between related languages, future methods should try to “see past” different spacing rules—so models can recognize shared patterns that are hidden by how words are written.

Overall, the paper shows that small choices in how we split words into pieces can have big effects on how well languages help each other in translation.

Glossary

  • Agglutinative: A morphological type where words are formed by stringing together morphemes, each carrying distinct meaning. "Siswati is a low-resource agglutinative language, so effective subword modelling is critical for dealing with the inevitably high proportion of out-of-vocabulary words in held-out datasets."
  • Analytic morphology: A morphological type with low morpheme-to-word ratio, relying more on separate words than affixes to express grammatical relations. "Afrikaans is linguistically unrelated to Siswati and also disjunctive, but because of its analytic morphology (lower morpheme-to-word ratio) its written words are sometimes more aligned to those of Siswati"
  • BPE: Byte Pair Encoding; a deterministic subword segmentation algorithm that iteratively merges frequent symbol pairs to build a vocabulary. "BPE more effectively facilitates transfer during cross-lingual fine-tuning."
  • chrF++: A reference-based MT evaluation metric combining character and word n-grams, robust to segmentation differences. "Instead, we use test set chrF++ to measure performance."
  • Cross-entropy loss: A loss function measuring the difference between predicted and true probability distributions, commonly used to evaluate MT models. "\citet{shaham-etal-2023-causes} use test set cross-entropy loss to measure MT performance, but this cannot be reliably used to compare across different subword segmentations."
  • Cross-lingual finetuning: Adapting a model trained on one language pair to another, leveraging learned representations for transfer. "In the cross-lingual finetuning experiments we finetune pretrained bilingual MT models on new languages."
  • Cross-lingual subword overlap: The extent to which subword units are shared across languages in a multilingual vocabulary, affecting transfer. "However, multilingual vocabularies are known to affect cross-lingual transfer through factors such as cross-lingual subword overlap"
  • Cross-lingual transfer: The phenomenon where knowledge learned from one language improves performance in another. "This paper studies the role of subword segmentation in cross-lingual transfer."
  • Deterministic segmenter: A subword tokenizer that produces a single, fixed segmentation for any input text. "We use it as a deterministic segmenter."
  • Disjunctive orthography: An orthographic convention where morphemes are written as separate space-delimited tokens, increasing token granularity. "Disjunctive orthographies write a single linguistic word as multiple orthographic words (e.g. in Setswana prefixal morphemes are space-separeted from verbal roots)."
  • FLORES: A benchmark dataset for evaluating low-resource and multilingual machine translation. "validate and test on FLORES"
  • Interference: Negative cross-lingual interaction where multilingual training degrades performance on a language. "There is a tradeoff between maximising positive cross-lingual transfer (also known as synergy) while minimising negative cross-lingual interaction (also known as interference)."
  • Language sampling temperature: A hyperparameter controlling how training data from multiple languages are sampled to balance exposure. "We use a language sampling temperature of T=1.5T=1.5 to balance exposure to low-resource and high-resource languages."
  • Low-resource language: A language with limited parallel data for training MT systems. "Low-resource languages stand to benefit most from multilingual modelling."
  • Morphological typology: The classification of languages based on how they form words from morphemes (e.g., analytic, agglutinative). "with particular focus on factors related to subword structure like morphological typology and orthographic word boundary conventions"
  • Multilingual modelling: Training a single model to handle multiple languages, sharing parameters and vocabularies. "Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations."
  • OBPE: Overlap-Based BPE; a variant of BPE that increases shared subword units across languages to improve transfer. "OBPE modifies BPE to boost subword overlap among languages in multilingual vocabularies."
  • Orthographic word boundary conventions: Language-specific rules about how morphemes and words are separated in writing, impacting pre-tokenization and subwording. "Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer"
  • Pre-tokenisation: The initial splitting of text into tokens before applying subword segmentation. "Orthographic word boundaries determine the pre-tokenisation of text before subword segmenters are applied"
  • Probabilistic segmenter: A tokenizer that samples from multiple valid segmentations according to a learned distribution, enabling regularization. "ULM learns segmentation to optimise a unigram LLM and can be used as a probabilistic segmenter"
  • SentencePiece: A language-independent subword tokenizer commonly used to build shared multilingual vocabularies. "using the same sentencepiece vocabulary across all models."
  • Shared parameter space: The common set of model weights used across multiple languages in a multilingual system. "However, increasing multilinguality in a limited shared parameter space can lead to suboptimal performance for high-resource languages"
  • SSMT: Subword Segmental Machine Translation; a model that jointly learns segmentation and target generation to optimise MT performance. "SSMT is a subword segmental MT model which learns subword segmentation jointly during MT training, with the goal of learning subwords that optimise MT performance."
  • Subword regularisation: Training-time stochastic variation of segmentations to improve robustness and generalisation. "Our findings show that subword regularisation boosts synergy in multilingual modelling"
  • Subword segmentation: The process of breaking words into smaller units (subwords) to handle rare or unseen words in MT. "One aspect their study failed to consider is subword segmentation."
  • Synergy: Positive cross-lingual transfer where multilingual training improves performance on a language. "There is a tradeoff between maximising positive cross-lingual transfer (also known as synergy) while minimising negative cross-lingual interaction (also known as interference)."
  • Transformer-base: A standard configuration of the Transformer architecture (as implemented in fairseq) used as a baseline. "We use the model size and training hyperparameters of the fairseq transformer-base architecture."
  • ULM: Unigram LLM tokenizer; a subword method that learns a probabilistic segmentation optimising a unigram LM objective. "ULM learns segmentation to optimise a unigram LLM"
  • Unigram LLM: A model assuming independence between tokens, used to score and sample subword segmentations. "ULM learns segmentation to optimise a unigram LLM"
  • WMT22: A shared task/dataset from the Workshop on Machine Translation 2022 used for training. "We train models on WMT22 data"
  • XBPE: Extended BPE; a technique to expand a pretrained model’s BPE vocabulary for new languages to improve transfer. "XBPE extends the BPE vocabulary of a pretrained model to include BPE subwords of a new translation direction."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 26 likes about this paper.