294 research outputs found
Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit
The primary focus of this thesis is to make Sanskrit manuscripts more
accessible to the end-users through natural language technologies. The
morphological richness, compounding, free word orderliness, and low-resource
nature of Sanskrit pose significant challenges for developing deep learning
solutions. We identify four fundamental tasks, which are crucial for developing
a robust NLP technology for Sanskrit: word segmentation, dependency parsing,
compound type identification, and poetry analysis. The first task, Sanskrit
Word Segmentation (SWS), is a fundamental text processing task for any other
downstream applications. However, it is challenging due to the sandhi
phenomenon that modifies characters at word boundaries. Similarly, the existing
dependency parsing approaches struggle with morphologically rich and
low-resource languages like Sanskrit. Compound type identification is also
challenging for Sanskrit due to the context-sensitive semantic relation between
components. All these challenges result in sub-optimal performance in NLP
applications like question answering and machine translation. Finally, Sanskrit
poetry has not been extensively studied in computational linguistics.
While addressing these challenges, this thesis makes various contributions:
(1) The thesis proposes linguistically-informed neural architectures for these
tasks. (2) We showcase the interpretability and multilingual extension of the
proposed systems. (3) Our proposed systems report state-of-the-art performance.
(4) Finally, we present a neural toolkit named SanskritShala, a web-based
application that provides real-time analysis of input for various NLP tasks.
Overall, this thesis contributes to making Sanskrit manuscripts more accessible
by developing robust NLP technology and releasing various resources, datasets,
and web-based toolkit.Comment: Ph.D. dissertatio
SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface for Pedagogical and Annotation Purposes
We present a neural Sanskrit Natural Language Processing (NLP) toolkit named
SanskritShala (a school of Sanskrit) to facilitate computational linguistic
analyses for several tasks such as word segmentation, morphological tagging,
dependency parsing, and compound type identification. Our systems currently
report state-of-the-art performance on available benchmark datasets for all
tasks. SanskritShala is deployed as a web-based application, which allows a
user to get real-time analysis for the given input. It is built with
easy-to-use interactive data annotation features that allow annotators to
correct the system predictions when it makes mistakes. We publicly release the
source codes of the 4 modules included in the toolkit, 7 word embedding models
that have been trained on publicly available Sanskrit corpora and multiple
annotated datasets such as word similarity, relatedness, categorization,
analogy prediction to assess intrinsic properties of word embeddings. So far as
we know, this is the first neural-based Sanskrit NLP toolkit that has a
web-based interface and a number of NLP modules. We are sure that the people
who are willing to work with Sanskrit will find it useful for pedagogical and
annotative purposes. SanskritShala is available at:
https://cnerg.iitkgp.ac.in/sanskritshala. The demo video of our platform can be
accessed at: https://youtu.be/x0X31Y9k0mw4.Comment: 7 pages, Accepted at ACL23 (Demo track) to be held at Toronto, Canad
Evaluation of Computational Grammar Formalisms for Indian Languages
Natural Language Parsing has been the most prominent research area since the genesis of Natural Language Processing. Probabilistic Parsers are being developed to make the process of parser development much easier, accurate and fast. In Indian context, identification of which Computational Grammar Formalism is to be used is still a question which needs to be answered. In this paper we focus on this problem and try to analyze different formalisms for Indian languages
PersoNER: Persian named-entity recognition
Ā© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
Introduction to the special issue on annotated corpora
International audienceLes corpus annoteĢs sont toujours plus cruciaux, aussi bien pour la recherche scien- tifique en linguistique que le traitement automatique des langues. Ce numeĢro speĢcial passe brieĢvement en revue lāeĢvolution du domaine et souligne les deĢfis aĢ relever en restant dans le cadre actuel dāannotations utilisant des cateĢgories analytiques, ainsi que ceux remettant en question le cadre lui-meĢme. Il preĢsente trois articles, lāun concernant lāeĢvaluation de la qualiteĢ dāannotation, et deux concernant des corpus arboreĢs du francĢ§ais, lāun traitant du plus ancien projet de corpus arboreĢ du francĢ§ais, le French Treebank, le second concernant la conversion de corpus francĢ§ais dans le scheĢma interlingue des Universal Dependencies, offrant ainsi une illustration de lāhistoire du deĢveloppement des corpus arboreĢs.Annotated corpora are increasingly important for linguistic scholarship, science and technology. This special issue briefly surveys the development of the field and points to challenges within the current framework of annotation using analytical categories as well as challenges to the framework itself. It presents three articles, one concerning the evaluation of the quality of annotation, and two concerning French treebanks, one dealing with the oldest project for French, the French Treebank, the second concerning the conversion of French corpora into the cross-lingual framework of Universal Dependencies, thus offering an illustration of the history of treebank development worldwide
- ā¦