A Two-Stage Noisy Pre-training and Fine-tuning Pipeline for Low-Resource Named Entity Recognition in Shahmukhi Punjabi

Basir, Nazish; Qabulio, Mumtaz; Memon, Muhammad Suleman; Arain, Danish Nazir; Nizamani , Sehrish Basir

Search results>Research output from VFAST - Virtual Foundation for Advancement of Science and Technology (Pakistan)

research article

oai:ojs.pkp.sfu.ca:article/2371

A Two-Stage Noisy Pre-training and Fine-tuning Pipeline for Low-Resource Named Entity Recognition in Shahmukhi Punjabi

Authors: Nazish Basir
Mumtaz Qabulio
Muhammad Suleman Memon
Danish Nazir Arain
Sehrish Basir Nizamani
Publication date: 10 April 2026
Publisher: VFAST Research Platform

Abstract

Named Entity Recognition (NER) for low-resource languages remains a critical challenge in natural language processing, particularly for scripts with limited annotated corpora. This paper addresses this challenge for Shahmukhi Punjabi, an underrepresented Perso-Arabic script used by millions in Pakistan. We propose a two-stage training pipeline that leverages a large-scale machine-labeled corpus generated by a Bagging-CRF ensemble to warm-start multilingual transformer models before fine-tuning on a small, gold-standard human-annotated dataset. We evaluate five state-of-the-art multilingual transformers, mBERT, XLM-R, mmBERT, RemBERT, and mDeBERTa-V3, under two experimental settings: (A) direct supervised fine-tuning on the human-labeled dataset, and (B) the proposed two-stage pipeline. The human-labeled dataset comprises 979 sentences and 25,221 tokens, while the larger machine-labeled corpus having 16,586 sentences and 336,502, both tokens covering 13 entity types. Experimental results demonstrate consistent improvements across all five models, mmBERT and RemBERT achieve the highest weighted F1 scores of 0.85 and 0.86 respectively. The most striking gains are observed for mDeBERTa-V3 (+0.21 F1, 39.6% relative) and XLM-R (+0.20 F1, 33.3% relative), demonstrating that the two-stage pipeline provides the greatest benefit to models with limited baseline performance on low-resource scripts. These results validate the effectiveness of noisy domain adaptation as a data augmentation strategy for low-resource NER in morphologically rich, right-to-left scripts

Similar works

Full text

Open in the Core reader

Download PDF

VFAST - Virtual Foundation for Advancement of Science and Technology (Pakistan)

oai:ojs.pkp.sfu.ca:article/237...

Last time updated on 16/05/2026

This paper was published in VFAST - Virtual Foundation for Advancement of Science and Technology (Pakistan).

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.