how do i train translation models on low-resource languages?

a guide into how to train translation models on low-resource languages

short blog here. just will discuss how i setup my pipeline to get models better on low resource languages. most of these were from experiences i learnt during random experiments

problem statement

let’s consider nepali, i dub it as a low resource language as it is in devanagiri script, hence models can usually confuse between them. while hindi has complex gendered nouns and foreign influences, nepali has a simpler, generally non-gendered system, different pronunciation, and unique grammatical structures.

some examples below —

which base model to choose

start from a very good multilingual model which can’t do nepali, but is trained on a considerable amount of devanagiri text.

for example use models like sarvamai/sarvam-translate · Hugging Face or ai4bharat/indictrans2-indic-indic-1B · Hugging Face

lora or full finetune?

usually you will be constrained on gpu resources (i guess), so please use lora finetuning.

my suggestion would be to take the lora rank should be half of the lora alpha (at least), the logic i use for this is below —

paper for the above excerpt - arxiv.org

where to get the data?

case 1 : internet has digitalized data

go to huggingface and search by language

case 2 : image data exists but not digitalized

this is devanagiri text at the end of the day, so we can easily ocr it!

pull up gemini flash and chunk your image data to 10-15 pages [ i found it best at this page length ]

tell it to output the result in a json like [nepali : {}, english:{}]

and you get the final output after a bit of cleaning

extension for case 2 : where do I get data for my ocr

we should think in 3 levels : word level, sentence level and document level

document level is very tough to solve even for good resource languages, so let’s avoid that

for word level
the best source of data is english-nepali dictionaries. you can scrape them and get accurate translations for nepali at word level

for sentence level
two ways to go about this —

train the model on the word level data, and chunk the sentences into words and then translate the sentences into nepali. the problem is these sentences wont have good context retention and many a times the sentences dont make much sense — use this method for very very very low resource languages
court judgements are good source, as many countries do run translations for court judgements. many folklore books in nepali will have english translations, get the data parallely for both of them. there are many many sources — you just have to find em

how do i finetune?

best way is to use unsloth for finetuning — Unsloth Notebooks | Unsloth Documentation
hop on any of their notebooks and finetune the model

the hyperparams i find best for MT are as follows —

from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 16,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 1000,
        learning_rate = 3e-5,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "wandb", # Use TrackIO/WandB etc
    ),
)

any instruction/system prompt?

the best system prompt that has given me the best results day after day is the translate gemma ( arxiv.org) system prompt

You are a professional {source_lang} ({src_lang_code}) to {target_lang}
({tgt_lang_code}) translator. Your goal is to accurately convey the meaning and
nuances of the original {source_lang} text while adhering to {target_lang} grammar,
vocabulary, and cultural sensitivities. Produce only the {target_lang}
translation, without any additional explanations or commentary. Please translate
the following {source_lang} text into {target_lang}:\n\n\n{text}

in some cases, just using this gave me a boost of 5-6 chrf.

how do i evaluate?

which metric do i evaluate on? i found chrf to be the best metric for low resource languages. this paper ( Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments ) validated that as well.

let’s talk about some innovative techniques

iterative backtranslation

paper to read — Iterative Back-Translation for Neural Machine Translation

the logic is simple

train weak en ↔ nepali model.
translate large monolingual nepali corpus → synthetic english.
train english → nepali on synthetic pairs.
re-translate.
repeat.

each round improves quality.

why does it work? better MT means better synthetic data and hence again better MT
this bootstrapping loop is really good!

variation for iterative backtranslation

instead of blindly using synthetic pairs, score translations using:

lm perplexity
round-trip consistency
language id confidence

this prevents noise explosion in early rounds.

how do i get better synthetic data?

paper to read — openaccess.thecvf.com

idea

train two separate models: model a and model b
generate synthetic data from both.

keep only examples where:

\[KL(p_a || p_b) < ε\]

this ensures synthetic data stability.

cross lingual consistency regularization for synthetic data

paper to read — Unsupervised Data Augmentation for Consistency Training

idea

for each sentence:

original lr sentence
backtranslated lr sentence
paraphrased lr sentence

force embedding similarity:

\[l_{consistency} = || f(x) - f(\tilde{x}) ||^2\]

you make the model invariant to synthetic noise. it forces the embeddings of semantically equivalent sentences to be close.

curriculum learning

good survey paper — Curriculum Learning: A Survey

idea

rank samples by:

sentence length
perplexity
morphological complexity
noise score

train in stages: easy sentences then medium then hard

this reduces early instability.

cycledistill

paper to read — CycleDistill: Bootstrapping Machine Translation using LLMs with...

idea

start with a large language model (llm) capable of few-shot translation.
use it to generate synthetic parallel corpora from monolingual text (e.g., lr language → english).
fine-tune the model on that synthetic parallel data.
repeat the cycle: generate more synthetic data using the updated model, then refine again.
optionally leverage softmax activations (soft target distributions) during distillation to improve performance.

final words

innovate. all the ideas above are just some examples. you must look at the problem and innovate solutions! sometimes even use 2-3 methods together. you don’t know what would work :)