222 research outputs found
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
We present Samanantar, the largest publicly available parallel corpora
collection for Indic languages. The collection contains a total of 49.7 million
sentence pairs between English and 11 Indic languages (from two language
families). Specifically, we compile 12.4 million sentence pairs from existing,
publicly-available parallel corpora, and additionally mine 37.4 million
sentence pairs from the web, resulting in a 4x increase. We mine the parallel
sentences from the web by combining many corpora, tools, and methods: (a)
web-crawled monolingual corpora, (b) document OCR for extracting sentences from
scanned documents, (c) multilingual representation models for aligning
sentences, and (d) approximate nearest neighbor search for searching in a large
collection of sentences. Human evaluation of samples from the newly mined
corpora validate the high quality of the parallel sentences across 11
languages. Further, we extract 83.4 million sentence pairs between all 55 Indic
language pairs from the English-centric parallel corpus using English as the
pivot language. We trained multilingual NMT models spanning all these languages
on Samanantar, which outperform existing models and baselines on publicly
available benchmarks, such as FLORES, establishing the utility of Samanantar.
Our data and models are available publicly at
https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance
research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational
Linguistics (TACL
MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of Indian Legal Case Judgments
Automatic summarization of legal case judgments is a practically important
problem that has attracted substantial research efforts in many countries. In
the context of the Indian judiciary, there is an additional complexity --
Indian legal case judgments are mostly written in complex English, but a
significant portion of India's population lacks command of the English
language. Hence, it is crucial to summarize the legal documents in Indian
languages to ensure equitable access to justice. While prior research primarily
focuses on summarizing legal case judgments in their source languages, this
study presents a pioneering effort toward cross-lingual summarization of
English legal documents into Hindi, the most frequently spoken Indian language.
We construct the first high-quality legal corpus comprising of 3,122 case
judgments from prominent Indian courts in English, along with their summaries
in both English and Hindi, drafted by legal practitioners. We benchmark the
performance of several diverse summarization approaches on our corpus and
demonstrate the need for further research in cross-lingual summarization in the
legal domain.Comment: Accepted at EMNLP 2023 (Main Conference
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
India has a rich linguistic landscape with languages from 4 major language
families spoken by over a billion people. 22 of these languages are listed in
the Constitution of India (referred to as scheduled languages) are the focus of
this work. Given the linguistic diversity, high-quality and accessible Machine
Translation (MT) systems are essential in a country like India. Prior to this
work, there was (i) no parallel training data spanning all the 22 languages,
(ii) no robust benchmarks covering all these languages and containing content
relevant to India, and (iii) no existing translation models which support all
the 22 scheduled languages of India. In this work, we aim to address this gap
by focusing on the missing pieces required for enabling wide, easy, and open
access to good machine translation systems for all 22 scheduled Indian
languages. We identify four key areas of improvement: curating and creating
larger training datasets, creating diverse and high-quality benchmarks,
training multilingual models, and releasing models with open access. Our first
contribution is the release of the Bharat Parallel Corpus Collection (BPCC),
the largest publicly available parallel corpora for Indic languages. BPCC
contains a total of 230M bitext pairs, of which a total of 126M were newly
added, including 644K manually translated sentence pairs created as part of
this work. Our second contribution is the release of the first n-way parallel
benchmark covering all 22 Indian languages, featuring diverse domains,
Indian-origin content, and source-original test sets. Next, we present
IndicTrans2, the first model to support all 22 languages, surpassing existing
models on multiple existing and new benchmarks created as a part of this work.
Lastly, to promote accessibility and collaboration, we release our models and
associated data with permissive licenses at
https://github.com/ai4bharat/IndicTrans2
Development of Domain Specific Cluster : An Integrated Framework for College Libraries under the University of Burdwan
This paper discusses the development of six domain specific cluster software in the college libraries under the University of Burdwan. Library is the heart of educational institutions. So, as to select the open source relevant with comprehensive software and global parameters on the basis of global recommendations like IFLA-Working Group, Integrated Library System for Discovery Interface (ILS-DI), Request for Proposals (RFP), Request for Comments (RFC), Service Oriented Architectre (SOA) and Open Library Environment Projects (OLE) including the areas like integrated library system cluster, digital media archiving cluster, content management system cluster, learning content management system cluster, federated search system cluster and college communication interaction cluster for designing and developing the college libraries under the University of Burdwan. Also develop the single window based interface in six domain specific cluster for the college librarians and the users to access their necessary resources through open source software and open standards. These six domain specific cluster softwares are to be selected for easily managed the digital and library resources in the college libraries affiliated to the University of Burdwan. This integrated framework can easily managed the housekeeping operations and information retrieval systems like acquisition, cataloguing, circulation, member generation, authority control, report generation and online public access catalogue for the users as well as library professionals also
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
Sinhala is the native language of the Sinhalese people who make up the
largest ethnic group of Sri Lanka. The language belongs to the globe-spanning
language tree, Indo-European. However, due to poverty in both linguistic and
economic capital, Sinhala, in the perspective of Natural Language Processing
tools and research, remains a resource-poor language which has neither the
economic drive its cousin English has nor the sheer push of the law of numbers
a language such as Chinese has. A number of research groups from Sri Lanka have
noticed this dearth and the resultant dire need for proper tools and research
for Sinhala natural language processing. However, due to various reasons, these
attempts seem to lack coordination and awareness of each other. The objective
of this paper is to fill that gap of a comprehensive literature survey of the
publicly available Sinhala natural language tools and research so that the
researchers working in this field can better utilize contributions of their
peers. As such, we shall be uploading this paper to arXiv and perpetually
update it periodically to reflect the advances made in the field
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken
Social media mining under the COVID-19 context: Progress, challenges, and opportunities
Social media platforms allow users worldwide to create and share information, forging vast sensing networks that
allow information on certain topics to be collected, stored, mined, and analyzed in a rapid manner. During the
COVID-19 pandemic, extensive social media mining efforts have been undertaken to tackle COVID-19 challenges
from various perspectives. This review summarizes the progress of social media data mining studies in the
COVID-19 contexts and categorizes them into six major domains, including early warning and detection, human
mobility monitoring, communication and information conveying, public attitudes and emotions, infodemic and
misinformation, and hatred and violence. We further document essential features of publicly available COVID-19
related social media data archives that will benefit research communities in conducting replicable and repro�ducible studies. In addition, we discuss seven challenges in social media analytics associated with their potential
impacts on derived COVID-19 findings, followed by our visions for the possible paths forward in regard to social
media-based COVID-19 investigations. This review serves as a valuable reference that recaps social media mining
efforts in COVID-19 related studies and provides future directions along which the information harnessed from
social media can be used to address public health emergencies
- …