1 research outputs found
Vacaspati: A Diverse Corpus of Bangla Literature
Bangla (or Bengali) is the fifth most spoken language globally; yet, the
state-of-the-art NLP in Bangla is lagging for even simple tasks such as
lemmatization, POS tagging, etc. This is partly due to lack of a varied quality
corpus. To alleviate this need, we build Vacaspati, a diverse corpus of Bangla
literature. The literary works are collected from various websites; only those
works that are publicly available without copyright violations or restrictions
are collected. We believe that published literature captures the features of a
language much better than newspapers, blogs or social media posts which tend to
follow only a certain literary pattern and, therefore, miss out on language
variety. Our corpus Vacaspati is varied from multiple aspects, including type
of composition, topic, author, time, space, etc. It contains more than 11
million sentences and 115 million words. We also built a word embedding model,
Vac-FT, using FastText from Vacaspati as well as trained an Electra model,
Vac-BERT, using the corpus. Vac-BERT has far fewer parameters and requires only
a fraction of resources compared to other state-of-the-art transformer models
and yet performs either better or similar on various downstream tasks. On
multiple downstream tasks, Vac-FT outperforms other FastText-based models. We
also demonstrate the efficacy of Vacaspati as a corpus by showing that similar
models built from other corpora are not as effective. The models are available
at https://bangla.iitk.ac.in/