Towards Constructing Corpus of Punjabi N-grams Written in Gurmukhi Script

Kawaljeet Singh, Charanjiv Singh Saroa,

Towards Constructing Corpus of Punjabi N-grams Written in Gurmukhi Script

Authors: Charanjiv Singh Saroa, Kawaljeet Singh
Publication date: 31 December 2023
Publisher: Auricle Global Society of Education and Research

Abstract

The availability of a robust corpus is crucial for developing linguistic resources. For the Punjabi language, written in the Gurmukhi script, the scarcity of such a resource hinders the validation of various natural language processing (NLP) techniques. This paper addresses this gap by presenting the creation of a comprehensive corpus for Punjabi in Gurmukhi. The corpus, with approximately 23 million words drawn from diverse published materials, serves as a valuable foundation for NLP research. Additionally, the paper describes a dedicated corpus processing tool designed specifically for Punjabi. This tool employs a novel method for constructing word, bigram, and trigram levels of the corpus, applicable for building such resources for any script. As a demonstration, we showcase a generated dataset composed of approximately 15.5 million Punjabi words and 50 million character

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

International Journal on Recent and Innovation Trends in Computing and Communication

oai:ojs2.ijritcc.com:article/1...

Last time updated on 10/05/2024