Towards Constructing Corpus of Punjabi N-grams Written in Gurmukhi Script

Abstract

The availability of a robust corpus is crucial for developing linguistic resources. For the Punjabi language, written in the Gurmukhi script, the scarcity of such a resource hinders the validation of various natural language processing (NLP) techniques. This paper addresses this gap by presenting the creation of a comprehensive corpus for Punjabi in Gurmukhi. The corpus, with approximately 23 million words drawn from diverse published materials, serves as a valuable foundation for NLP research. Additionally, the paper describes a dedicated corpus processing tool designed specifically for Punjabi. This tool employs a novel method for constructing word, bigram, and trigram levels of the corpus, applicable for building such resources for any script. As a demonstration, we showcase a generated dataset composed of approximately 15.5 million Punjabi words and 50 million character

    Similar works