Search CORE

885 research outputs found

Part-of-Speech Tag Disambiguation by Cross-Linguistic Majority Vote

Author: Aepli Noëmi
Samardžić Tanja
von Waldenfels Ruprecht
Publication venue: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages
Publication date: 01/01/2014
Field of study

Crossref

ZORA

Towards robust multi-tool tagging: an OWL/DL-based approach

Author: Chiarcos Christian
Publication venue
Publication date: 19/05/2023
Field of study

OPUS Augsburg

Recommended from our members

Identifying and Modeling Code-Switched Language

Author: Soto Martinez Victor
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2020
Field of study

Code-switching is the phenomenon by which bilingual speakers switch between multiple languages during written or spoken communication. The importance of developing language technologies that are able to process code-switched language is immense, given the large populations that routinely code-switch. Current NLP and Speech models break down when used on code-switched data, interrupting the language processing pipeline in back-end systems and forcing users to communicate in ways which for them are unnatural. There are four main challenges that arise in building code-switched models: lack of code-switched data on which to train generative language models; lack of multilingual language annotations on code-switched examples which are needed to train supervised models; little understanding of how to leverage monolingual and parallel resources to build better code-switched models; and finally, how to use these models to learn why and when code-switching happens across language pairs. In this thesis, I look into different aspects of these four challenges. The first part of this thesis focuses on how to obtain reliable corpora of code-switched language. We collected a large corpus of code-switched language from social media using a combination of sets of anchor words that exist in one language and sentence-level language taggers. The newly obtained corpus is superior to other corpora collected via different strategies when it comes to the amount and type of bilingualism in it. It also helps train better language tagging models. We also have proposed a new annotation scheme to obtain part-of-speech tags for code-switched English-Spanish language. The annotation scheme is composed of three different subtasks including automatic labeling, word-specific questions labeling and question-tree word labeling. The part-of-speech labels obtained for the Miami Bangor corpus of English-Spanish conversational speech show very high agreement and accuracy. The second section of this thesis focuses on the tasks of part-of-speech tagging and language modeling. For the first task, we proposed a state-of-the-art approach to part-of-speech tagging of code-switched English-Spanish data based on recurrent neural networks.Our models were tested on the Miami Bangor corpus on the task of POS tagging alone, for which we achieved 96.34% accuracy, and joint part-of-speech and language ID tagging,which achieved similar POS tagging accuracy (96.39%) and very high language ID accuracy (98.78%). For the task of language modeling, we first conducted an exhaustive analysis of the relationship between cognate words and code-switching. We then proposed a set of cognate-based features that helped improve language modeling performance by 12% relative points. Furthermore, we showed that these features can also be used across language pairs and still obtain performance improvements. Finally, we tackled the question of how to use monolingual resources for code-switching models by pre-training state-of-the-art cross-lingual language models on large monolingual corpora and fine-tuning them on the tasks of language modeling and word-level language tagging on code-switched data. We obtained state-of-the-art results on both tasks

Columbia University Academic Commons

Combining ontologies and neural networks for analyzing historical language varieties: a case study in Middle Low German

Author: Chiarcos Christian
Sukhareva Maria
Publication venue
Publication date: 27/04/2023
Field of study

In this paper, we describe experiments on the morphosyntactic annotation of historical language varieties for the example of Middle Low German (MLG), the official language of the German Hanse during the Middle Ages and a dominant language around the Baltic Sea by the time. To our best knowledge, this is the first experiment in automatically producing morphosyntactic annotations for Middle Low German, and accordingly, no part-of-speech (POS) tagset is currently agreed upon. In our experiment, we illustrate how ontology-based specifications of projected annotations can be employed to circumvent this issue: Instead of training and evaluating against a given tagset, we decomponse it into independent features which are predicted independently by a neural network. Using consistency constraints (axioms) from an ontology, then, the predicted feature probabilities are decoded into a sound ontological representation. Using these representations, we can finally bootstrap a POS tagset capturing only morphosyntactic features which could be reliably predicted. In this way, our approach is capable to optimize precision and recall of morphosyntactic annotations simultaneously with bootstrapping a tagset rather than performing iterative cycles

OPUS Augsburg

Domain-Specific Knowledge Acquisition for Conceptual Sentence Analysis

Author: Cardie Claire
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/1994
Field of study

The availability of on-line corpora is rapidly changing the field of natural language processing (NLP) from one dominated by theoretical models of often very specific linguistic phenomena to one guided by computational models that simultaneously account for a wide variety of phenomena that occur in real-world text. Thus far, among the best-performing and most robust systems for reading and summarizing large amounts of real-world text are knowledge-based natural language systems. These systems rely heavily on domain-specific, handcrafted knowledge to handle the myriad syntactic, semantic, and pragmatic ambiguities that pervade virtually all aspects of sentence analysis. Not surprisingly, however, generating this knowledge for new domains is time-consuming, difficult, and error-prone, and requires the expertise of computational linguists familiar with the underlying NLP system. This thesis presents Kenmore, a general framework for domain-specific knowledge acquisition for conceptual sentence analysis. To ease the acquisition of knowledge in new domains, Kenmore exploits an on-line corpus using symbolic machine learning techniques and robust sentence analysis while requiring only minimal human intervention. Unlike most approaches to knowledge acquisition for natural language systems, the framework uniformly addresses a range of subproblems in sentence analysis, each of which traditionally had required a separate computational mechanism. The thesis presents the results of using Kenmore with corpora from two real-world domains (1) to perform part-of-speech tagging, semantic feature tagging, and concept tagging of all open-class words in the corpus; (2) to acquire heuristics for part-ofspeech disambiguation, semantic feature disambiguation, and concept activation; and (3) to find the antecedents of relative pronouns

CiteSeerX

ScholarWorks@UMass Amherst

Combining Language Independent Part-of-Speech Tagging Tools

Author
Publication venue: OASIcs - OpenAccess Series in Informatics. 2nd Symposium on Languages, Applications and Technologies
Publication date: 01/01/2013
Field of study

Part-of-speech tagging is a fundamental task of natural language processing. For languages with a very rich agglutinating morphology, generic PoS tagging algorithms do not yield very high accuracy due to data sparseness issues. Though integrating a morphological analyzer can efficiently solve this problem, this is a resource-intensive solution. In this paper we show a method of combining language independent statistical solutions -- including a statistical machine translation tool -- of PoS-tagging to effectively boost tagging accuracy. Our experiments show that, using the same training set, our combination of language independent tools yield an accuracy that approaches that of a language dependent system with an integrated morphological analyzer

Dagstuhl Research Online Publication Server

Memory-Based Shallow Parsing

Author: Sang Erik F. Tjong Kim
Publication venue
Publication date: 01/01/2002
Field of study

We present memory-based learning approaches to shallow parsing and apply these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing. We use feature selection techniques and system combination methods for improving the performance of the memory-based learner. Our approach is evaluated on standard data sets and the results are compared with that of other systems. This reveals that our approach works well for base phrase identification while its application towards recognizing embedded structures leaves some room for improvement

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Statistical model of human lexical category disambiguation

Author: Corley Steffan
Publication venue: The University of Edinburgh
Publication date: 01/01/1998
Field of study

Research in Sentence Processing is concerned with discovering the mechanism by which linguistic utterances are mapped onto meaningful representations within the human mind. Models of the Human Sentence Processing Mechanism (HSPM) can be divided into those in which such mapping is performed by a number of limited modular processes and those in which there is a single interactive process. A further, and increasingly important, distinction is between models which rely on innate preferences to guide decision processes and those which make use of experiencebased statistics. In this context, the aims of the current thesis are two-fold: • To argue that the correct architecture of the HSPM is both modular and statistical - the Modular Statistical Hypothesis (MSH). • To propose and provide empirical support for a position in which human lexical category disambiguation occurs within a modular process, distinct from syntactic parsing and guided by a statistical decision process. Arguments are given for why a modular statistical architecture should be preferred on both methodological and rational grounds. We then turn to the (often ignored) problem of lexical category disambiguation and propose the existence of a presyntactic Statistical Lexical Category Module (SLCM). A number of variants of the SLCM are introduced. By empirically investigating this particular architecture we also hope to provide support for the more general hypothesis - the MSH. The SLCM has some interesting behavioural properties; the remainder of the thesis empirically investigates whether these behaviours are observable in human sentence processing. We first consider whether the results of existing studies might be attributable to SLCM behaviour. Such evaluation provides support for an HSPM architecture that includes this SLCM and allows us to determine which SLCM variant is empirically most plausible. Predictions are made, using this variant, to determine SLCM behaviour in the face of novel utterances; these predictions are then tested using a self-paced reading paradigm. The results of this experimentation fully support the inclusion of the SLCM in a model of the HSPM and are not compatible with other existing models. As the SLCM is a modular and statistical process, empirical evidence for the SLCM also directly supports an HSPM architecture which is modular and statistical. We therefore conclude that our results strongly support both the SLCM and the MSH. However, more work is needed, both to produce further evidence and to define the model further

Edinburgh Research Archive