40 research outputs found
Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit
The primary focus of this thesis is to make Sanskrit manuscripts more
accessible to the end-users through natural language technologies. The
morphological richness, compounding, free word orderliness, and low-resource
nature of Sanskrit pose significant challenges for developing deep learning
solutions. We identify four fundamental tasks, which are crucial for developing
a robust NLP technology for Sanskrit: word segmentation, dependency parsing,
compound type identification, and poetry analysis. The first task, Sanskrit
Word Segmentation (SWS), is a fundamental text processing task for any other
downstream applications. However, it is challenging due to the sandhi
phenomenon that modifies characters at word boundaries. Similarly, the existing
dependency parsing approaches struggle with morphologically rich and
low-resource languages like Sanskrit. Compound type identification is also
challenging for Sanskrit due to the context-sensitive semantic relation between
components. All these challenges result in sub-optimal performance in NLP
applications like question answering and machine translation. Finally, Sanskrit
poetry has not been extensively studied in computational linguistics.
While addressing these challenges, this thesis makes various contributions:
(1) The thesis proposes linguistically-informed neural architectures for these
tasks. (2) We showcase the interpretability and multilingual extension of the
proposed systems. (3) Our proposed systems report state-of-the-art performance.
(4) Finally, we present a neural toolkit named SanskritShala, a web-based
application that provides real-time analysis of input for various NLP tasks.
Overall, this thesis contributes to making Sanskrit manuscripts more accessible
by developing robust NLP technology and releasing various resources, datasets,
and web-based toolkit.Comment: Ph.D. dissertatio
Location Reference Recognition from Texts: A Survey and Comparison
A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of its specific applications is still missing. Further, there is a lack of a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching–based, statistical learning-–based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references worldwide. Results from this thorough evaluation can help inform future methodological developments and can help guide the selection of proper approaches based on application needs
Investigating multilingual approaches for parsing universal dependencies
Multilingual dependency parsing encapsulates any attempt to parse multiple languages. It can involve parsing multiple languages in isolation (poly-monolingual), leveraging training data from multiple languages to process any of the included languages (polyglot), or training on one or multiple languages to process a low-resource language with no training data (zero-shot). In this thesis, we explore multilingual dependency parsing across all three paradigms, first analysing whether polyglot training on a number of source languages is beneficial for processing a target language in a zero-shot cross-lingual dependency parsing experiment using annotation projection. The results of this experiment show that polyglot training produces an overall trend of better results on the target language but a highly-related single source language can still be better for transfer.
We then look at the role of pretrained language models in processing a moderately low-resource language in Irish. Here, we develop our own monolingual Irish BERT model gaBERT from scratch and compare it to a number of multilingual baselines, showing that developing a monolingual language model for Irish is worthwhile. We then turn to the topic of parsing Enhanced Universal Dependencies (EUD) Graphs, which are an extension of basic Universal Dependencies trees, where we describe the DCU-EPFL submission to the 2021 IWPT shared task on EUD parsing. Here, we developed a multitask model to jointly learn the tasks of basic dependency parsing and EUD graph parsing, showing improvements over a single-task basic dependency parser. Lastly, we revisit the topic of polyglot parsing and investigate whether multiview learning can be applied to the problem of multilingual dependency parsing. Here, we learn different views based on the dataset source. We show that multiview learning can be used to train parsers with multiple datasets, showing a general improvement over single-view baselines
Location reference recognition from texts: A survey and comparison
A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to the process of recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of the specific applications is still missing. Further, there lacks a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and a core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching-based, statistical learning-based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references across the world. Results from this thorough evaluation can help inform future methodological developments for location reference recognition, and can help guide the selection of proper approaches based on application needs
A Kind Introduction to Lexical and Grammatical Aspect, with a Survey of Computational Approaches
Aspectual meaning refers to how the internal temporal structure of situations
is presented. This includes whether a situation is described as a state or as
an event, whether the situation is finished or ongoing, and whether it is
viewed as a whole or with a focus on a particular phase. This survey gives an
overview of computational approaches to modeling lexical and grammatical aspect
along with intuitive explanations of the necessary linguistic concepts and
terminology. In particular, we describe the concepts of stativity, telicity,
habituality, perfective and imperfective, as well as influential inventories of
eventuality and situation types. We argue that because aspect is a crucial
component of semantics, especially when it comes to reporting the temporal
structure of situations in a precise way, future NLP approaches need to be able
to handle and evaluate it systematically in order to achieve human-level
language understanding.Comment: Accepted at EACL 2023, camera ready versio