Efficient Domain Adaptation for Text Line Recognition via Decoupled Language Models

Published 30 Mar 2026 in cs.CV and cs.LG | (2603.28028v1)

Abstract: Optical character recognition remains critical infrastructure for document digitization, yet state-of-the-art performance is often restricted to well-resourced institutions by prohibitive computational barriers. End-to-end transformer architectures achieve strong accuracy but demand hundreds of GPU hours for domain adaptation, limiting accessibility for practitioners and digital humanities scholars. We present a modular detection-and-correction framework that achieves near-SOTA accuracy with single-GPU training. Our approach decouples lightweight visual character detection (domain-agnostic) from domain-specific linguistic correction using pretrained sequence models including T5, ByT5, and BART. By training the correctors entirely on synthetic noise, we enable annotation-free domain adaptation without requiring labeled target images. Evaluating across modern clean handwriting, cursive script, and historical documents, we identify a critical "Pareto frontier" in architecture selection: T5-Base excels on modern text with standard vocabulary, whereas ByT5-Base dominates on historical documents by reconstructing archaic spellings at the byte level. Our results demonstrate that this decoupled paradigm matches end-to-end transformer accuracy while reducing compute by approximately 95%, establishing a viable, resource-efficient alternative to monolithic OCR architectures.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents a decoupled text line recognition framework that separates visual detection from linguistic correction, significantly reducing compute costs for domain adaptation.
It employs transformer-based detectors with fine-tuned language models (T5, ByT5, BART) tailored to modern, cursive, and historical handwriting, ensuring robust performance across diverse domains.
Experimental results demonstrate competitive character error rates with up to 95% computational savings compared to end-to-end transformer architectures.

Efficient Domain Adaptation for Text Line Recognition via Decoupled LLMs

Motivation and Context

The landscape of text line recognition (TLR) in optical character recognition (OCR) is dominated by end-to-end transformer-based models, which offer robust performance but are computationally prohibitive for domain adaptation. This computational barrier restricts access for researchers and practitioners, especially in the digital humanities or archival domains, where adapting to new data sources—often characterized by specialized or archaic orthography—is essential. The approach presented in "Efficient Domain Adaptation for Text Line Recognition via Decoupled LLMs" (2603.28028) seeks to resolve this bottleneck by introducing a fully decoupled TLR architecture: a lightweight, reusable visual character detector paired with a domain-adaptable, pretrained linguistic corrector, supporting efficient adaptation with minimal supervision and annotation.

Figure 1: CVL (Modern): Clean, standard vocabulary.

Decoupled Detection and Correction Framework

Visual Character Detection

The visual front-end leverages a transformer-based detector, DINO-DETR, which localizes and classifies individual characters in parallel using a bipartite matching mechanism similar to DETR-style object detectors. The detector is pretrained using synthetically generated Latin-script text lines, incorporating font variability, color perturbation, structured noise, and occlusion augmentations. Domain adaptation is enacted through lightweight fine-tuning using Connectionist Temporal Classification (CTC) loss, requiring only line-level, not character-level, annotations. The key architectural feature here is that the detector is trained once, and subsequent adaptation to new domains does not require further visual retraining—addressing a critical inefficiency in end-to-end TLR transformers.

Linguistic Correction via Pretrained LLMs

Residual recognition errors post-visual detection, which arise from visual ambiguities, function word omissions, or lack of linguistic context, are corrected using a pretrained LLM. The proposed framework evaluates three classes of correctors:

T5-Base (Token-Level): Exploits SentencePiece tokenization, leveraging a comprehensive pretrained dictionary for modern linguistic domains.
ByT5-Base (Byte-Level): Operates directly at the UTF-8 byte level, circumventing tokenization bottlenecks on OOV or non-standard orthography—essential for historical documents.
BART-Base (Denoising): Functions as a lightweight baseline for denoising and error correction tasks.

Language correctors are exclusively fine-tuned on synthetic noise without requiring real annotated images, allowing for computationally cheap, annotation-free adaptation. The framework also develops domain-specific noise patterns, such as "Cursive-Collapse" for cursive scripts, to replicate script-specific ambiguities.

Experimental Results

Datasets and Domain Spectrum

The evaluation encompasses three canonical TLR benchmarks:

CVL: Modern clean handwriting, standardized vocabulary.
IAM: Modern cursive handwriting, high visual ambiguity.
George Washington (GW): Historical manuscripts, significant orthographic drift, and physical degradation.

This difficulty gradient enables the analysis of both the structural and linguistic adaptability of the framework.

Quantitative Outcomes and Pareto Frontiers

CVL (Modern): T5-Base achieves competitive accuracy (1.90% CER), matching end-to-end transformer models with $\sim$ 95% reduction in compute resource requirements.
GW (Historical): ByT5-Base outperforms token-based models (5.35% vs. 5.86% CER) due to its resilience against OOV and non-modern orthography, leveraging its byte-level operation.
IAM (Cursive): BART attains best performance (5.18% CER), while ByT5 shows sensitivity to the synthetic noise regime; effective correction is only realized when training noise is tailored to reflect cursive specificities.

The experimental results lead to explicit architectural guidance: T5-Base is optimal for modern, standard-vocabulary domains, ByT5-Base is indispensable for historical texts with non-standard orthography, and BART provides a robust lightweight alternative for cursive script.

Efficiency

The total domain adaptation cost with this methodology is $\sim$ 22-26 GPU hours, in stark contrast to 200–600 GPU hours required by SOTA monolithic transformer architectures—factor reductions ( $>10\times$ ) that democratize high-accuracy OCR for a much broader audience.

Qualitative Analysis and Error Characterization

There is a non-trivial semantic bias in token-level correctors, particularly T5, which sometimes "modernizes" spelling or leverages prior knowledge to hallucinate named entities or plausible but inaccurate sequences. This is less pronounced in ByT5, which remains more orthographically faithful—a critical distinction for archive-quality transcription.

Implications and Theoretical Impact

This modular pipeline provides a new efficiency–accuracy operating point for OCR, unlocking practical SOTA performance for end-users lacking access to enterprise-scale hardware. More fundamentally, the work underscores the detrimental impact of monolithic architectures for domain adaptation in sequence recognition. The flexibility to swap in byte-level or token-level correctors, fine-tuned on modular synthetic noise strategies, points to a general principle: decoupling visual and linguistic learning stages not only increases resource efficiency but introduces significant robustness and broadens applicability to low-data or highly specialized domains.

Pragmatically, the framework supports faster model deployment, annotation-free adaptation, and extensibility to new scripts or domain classes. Theoretically, it advocates for architectural modularity as a driver of OCR progress, especially as document analysis expands to non-standard scripts and languages.

Limitations and Future Directions

The proposed method's efficacy depends on the quality of the synthetic noise regime—particularly for byte-level correctors. When visual input is severely degraded (e.g., below 20% detector character accuracy), the corrector cannot reliably reconstruct the original text. Some accuracy gap remains compared to end-to-end architectures, especially in maximizing absolute accuracy or in severely OOV domains. These limitations suggest that future work should focus on: joint optimization of detector/corrector to reduce error propagation, distillation into lighter models, multilingual/cross-script adaptation (using mBART or similar), and automatic, detection-specific noise generation pipelines.

Conclusion

Decoupling visual and linguistic modeling for TLR provides an empirically validated path toward resource-efficient, high-accuracy OCR systems. This modular architecture achieves SOTA-level performance on both modern and historical handwriting, with drastic reductions in compute and annotation costs, while enabling architectural specialization that is not possible in monolithic models. This paradigm is likely to become central in practical document analysis tasks, especially as attention intensifies on specialized, cross-linguistic, or low-resource digitization efforts.

Markdown Report Issue