L-ReLF: A Framework for Lexical Dataset Creation

Published 31 Mar 2026 in cs.CL | (2603.29346v1)

Abstract: This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents L-ReLF, an expert-guided methodology to transform non-digital sources into structured lexical datasets.
It employs a dual-path strategy combining OCR and manual transcription, achieving 100% verified accuracy in data extraction.
The framework standardizes lexical data for seamless Wikidata integration, empowering robust NLP and language resource development for low-resource languages.

L-ReLF: A Methodology for High-Quality Lexical Dataset Creation in Low-Resource Languages

Context and Motivation

The significant scarcity of high-quality, structured lexical resources for low-resource languages (LRLs) presents a persistent challenge for NLP research, resource construction, and linguistic standardization. Moroccan Darija, a prominent dialect in North Africa, exemplifies these challenges due to its lack of standardization, insufficient computational resources, and systemic infrastructural gaps. L-ReLF (Low-Resource Lexical Framework) provides a reproducible, expert-informed pipeline to create lexical datasets suitable for critical downstream tasks and open knowledge repositories such as Wikidata Lexemes.

Positioning in Existing Literature

Conventional LRLP workflows have prioritized corpus-based, quantitative approaches that aggregate unstructured text, often prioritizing token count rather than structured lexical quality. Recent initiatives in Moroccan Darija focus primarily on text corpora to enable statistical training of MT and LLM models, as seen in works constructing open-access chat corpora and sentiment benchmarks. However, these lack the explicit grammatical annotation and standardized lexical fields that enable lexicon generation and effective morphological analysis. The transition from text corpora to structured lexical knowledge graphs is broadly missing in the literature, as LRL research typically fails to address the chasm between source material digitization and semantic structuring. L-ReLF fills this lacuna by detailing a technical methodology for extracting, processing, and structuring lexemes from non-digital sources, ultimately rendering them interoperable with Wikidata’s open lexicon infrastructure.

Technical Contributions

The L-ReLF methodology systematically converts fragmented physical textual sources into a high-precision, Wikidata-compatible lexical dataset. Its pipeline is organized in distinct technical stages:

Source Identification

Prioritizing script authenticity and technical relevance, sources were exclusively selected from specialized, academic print dictionaries in Arabic script, eschewing unstructured and Latin-script web data. This decision is substantiated by the superior linguistic richness and structural regularity of academic resources compared to noisy, nonstandard digital texts.

Digitalization and Human-in-the-Loop Extraction

Due to the non-digital character of most Darija resources, L-ReLF employs a dual-path strategy: leveraging Google Drive OCR for high-quality scans and resorting to full manual transcription for low-quality physical sources. This design acknowledges and explicitly quantifies the limitations of commercial OCR systems, especially their systematic bias toward Standard Arabic—resulting in a character error rate of 17% and 59% of errors attributable to MSA over-correction. Consequently, L-ReLF mandates an HITL intervention that shifts responsibility for dataset quality from noisy automated pipelines to deterministic, expert-driven validation.

Standardization, Structuring, and Verification

A unified, spreadsheet-based data model encodes lexemes, grammatical features (category, gender), etymology, and semantic/morphological links. Standardization procedures include rigorous cleansing, duplicate consolidation, and missing field completion based on native linguistic expertise. A dual-pass verification routine ensures that all errors from OCR and manual transcription are eliminated, with reported final transcription accuracy of 100% on verified entries. Semantic and morphological relationships are explicitly structured to facilitate downstream tasks such as pattern extraction and systematic term generation.

Wikidata Integration

The dataset is explicitly tailored for seamless mapping into Wikidata Lexemes, leveraging established tools (QuickStatements, OpenRefine) and Python scripts for column-property assignment. This enables future longitudinal reproducibility, data sustainability, and direct impact on collaborative digital dictionary projects (e.g., Wiktionary) and language technology pipelines.

Critical Results and Claims

Error Quantification and Reliability: The framework provides a rare quantitative characterization of OCR limitations in dialectal Arabic, motivating the adoption of semi-automated, HITL pipelines.
Structural, Verified Dataset: L-ReLF delivers a deterministic, fully human-verified dataset with 100% transcription fidelity on validated entries, overcoming the probabilistic noise associated with previous approaches.
Generalizability: The methodology is designed to be language-agnostic and adaptable, provided suitable print resources exist, with all scripts and data models released under open licenses for effortless adaptation by other LRL communities.

Implications and Future Prospects

L-ReLF’s structured approach contributes foundational infrastructure for LRL NLP, with direct implications for:

MT and Morphological Tools: The standardized dataset supports supervised and rule-based methods by providing reliable morphosyntactic features and derivational patterns, a prerequisite for high-fidelity generative systems in LRLs.
Open Knowledge Integration: Plug-and-play compatibility with Wikidata Lexemes fosters sustainable, decentralized community curation and facilitates integration into digital dictionaries and global knowledge graphs.
Reproducibility and Community Empowerment: By equipping volunteer editors without expert linguistic training, L-ReLF enhances the democratization of linguistic resource creation, enabling scalable, participatory lexicon development.

Persistent challenges include the time-intensive requirement for expert manual verification—a bottleneck for scaling to very large vocabularies. Overcoming this will require either advances in dialect-specific OCR or further semi-automated, domain-adaptive data cleaning protocols. As the immediate next step, the framework moves toward automating pattern extraction from the structured data to systematize new term generation, with future validation through participatory community methodologies.

Conclusion

L-ReLF delivers a reproducible, expert-informed methodology to construct high-quality lexical datasets for low-resource languages, addressing a critical infrastructural gap obstructing language technology, resource construction, and linguistic standardization. Its HITL pipeline, explicit quantification of technical limitations, and enforceable data model provide a reference implementation that is generalizable and immediately impactful for related language communities. Future developments will emphasize the automation of quality assurance, methodological formalization for neologism creation, and empirical validation through community-driven evaluation.

Markdown Report Issue