1,254 research outputs found
Recommended from our members
Minimally supervised induction of morphology through bitexts
textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems.
Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis.
While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic
Embeddings for word sense disambiguation: an evaluation study
Recent years have seen a dramatic growth in the popularity of word embeddings mainly owing to their ability to capture semantic information from massive amounts of textual content. As a result, many tasks in Natural Language Processing have tried to take advantage of the potential of these distributional models. In this work, we study how word embeddings can be used in Word Sense Disambiguation, one of the oldest tasks in Natural Language Processing and Artificial Intelligence. We propose different methods through which word embeddings can be leveraged in a state-of-the-art supervised WSD system architecture, and perform a deep analysis of how different parameters affect performance. We show how a WSD system that makes use of word embeddings alone, if designed properly, can provide significant performance improvement over a state-of-the-art WSD system that incorporates several standard WSD features
Minimal supervision for language learning: bootstrapping global patterns from local knowledge
A fundamental step in sentence comprehension involves assigning semantic roles
to sentence constituents. To accomplish this, the listener
must parse the sentence, find constituents that are candidate arguments, and
assign semantic roles to those constituents. Each step depends on prior lexical
and syntactic knowledge. Where do children begin in solving this problem when
learning their first languages? To experiment with different representations
that children may use to begin understanding language, we have built a
computational model for this early point in language acquisition. This system,
BabySRL, learns from transcriptions of natural child-directed speech and makes
use of psycholinguistically plausible background knowledge and realistically
noisy semantic feedback to begin to classify sentences at the level of ``who
does what to whom.''
Starting with simple, psycholinguistically-motivated representations of
sentence structure, the BabySRL is able to learn from full semantic feedback,
as well as a supervision signal derived from partial semantic background
knowledge. In addition we combine the BabySRL with an unsupervised Hidden
Markov Model part-of-speech tagger, linking clusters with syntactic categories
using background noun knowledge so that they can be used to parse input for the
SRL system. The results show that proposed shallow representations of sentence
structure are robust to reductions in parsing accuracy, and that the
contribution of alternative representations of sentence structure to successful
semantic role labeling varies with the integrity of the parsing and
argument-identification stages. Finally, we enable the BabySRL to improve both
an intermediate syntactic representation and its final semantic role
classification. Using this system we show that it is possible for a simple
learner in a plausible (noisy) setup to begin comprehending simple semantics
when initialized with a small amount of concrete noun knowledge and some simple
syntax-semantics mapping biases, before acquiring any specific verb knowledge
Knowledge-based Biomedical Data Science 2019
Knowledge-based biomedical data science (KBDS) involves the design and
implementation of computer systems that act as if they knew about biomedicine.
Such systems depend on formally represented knowledge in computer systems,
often in the form of knowledge graphs. Here we survey the progress in the last
year in systems that use formally represented knowledge to address data science
problems in both clinical and biological domains, as well as on approaches for
creating knowledge graphs. Major themes include the relationships between
knowledge graphs and machine learning, the use of natural language processing,
and the expansion of knowledge-based approaches to novel domains, such as
Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages
with 3 table
Automatic Discovery and Ranking of Synonyms for Search Keywords in the Web
Search engines are an indispensable part of a web user's life. A vast majority of these web users experience difficulties caused by the keyword-based search engines such as inaccurate results for queries and irrelevant URLs even though the given keyword is present in them. Also, relevant URLs may be lost as they may have the synonym of the keyword and not the original one. This condition is known as the polysemy problem. To alleviate these problems, we propose an algorithm called automatic discovery and ranking of synonyms for search keywords in the web (ADRS). The proposed method generates a list of candidate synonyms for individual keywords by employing the relevance factor of the URLs associated with the synonyms. Then, ranking of these candidate synonyms is done using co-occurrence frequencies and various page count-based measures. One of the major advantages of our algorithm is that it is highly scalable which makes it applicable to online data on the dynamic, domain-independent and unstructured World Wide Web. The experimental results show that the best results are obtained using the proposed algorithm with WebJaccard
Effective distant supervision for end-to-end knowledge base population systems
The growing amounts of textual data require automatic methods for structuring relevant information so that it can be further processed by computers and systematically accessed by humans.
The scenario dealt with in this dissertation is known as Knowledge Base Population (KBP), where relational information about entities is retrieved from a large text collection and stored in a database, structured according to a pre-specified schema.
Most of the research in this dissertation is placed in the context of the KBP benchmark of the Text Analysis Conference (TAC KBP), which provides a test-bed to examine all steps in a complex end-to-end relation extraction setting.
In this dissertation a new state of the art for the TAC KBP benchmark was achieved by focussing on the following research problems:
(1) The KBP task was broken down into a modular pipeline of sub-problems, and the most pressing issues were identified and quantified at all steps.
(2) The quality of semi-automatically generated training data was increased by developing noise-reduction methods, decreasing the influence of false-positive training examples.
(3) A focus was laid on fine-grained entity type modelling, entity expansion, entity matching and tagging, to maintain as much recall as possible on the relational argument level.
(4) A new set of effective methods for generating training data, encoding features and training relational classifiers was developed and compared with previous state-of-the-art methods.Die wachsende Menge an Textdaten erfordert Methoden, relevante Informationen so zu strukturieren, dass sie von Computern weiterverarbeitet werden können, und dass Menschen systematisch auf sie zugreifen können. Das in dieser Dissertation behandelte Szenario ist unter dem Begriff Knowledge Base Population (KBP) bekannt. Hier werden relationale Informationen über Entitäten aus großen Textbeständen automatisch zusammengetragen und gemäß einem vorgegebenen Schema strukturiert.
Ein Großteil der Forschung der vorliegenden Dissertation ist im Kontext des TAC KBP Vergleichstests angesiedelt. Dieser stellt ein Testumfeld dar, um alle Schritte eines anfragebasierten Relationsextraktions-Systems zu untersuchen.
Die in der vorliegenden Dissertation entwickelten Verfahren setzen einen neuen Standard für TAC KBP. Dies wurde durch eine Schwerpunktsetzung auf die folgenden Forschungsfragen erreicht:
Erstens wurden die wichtigsten Unterprobleme von KBP identifiziert und die jeweiligen Effekte genau quantifiziert.
Zweitens wurde die Qualität von halbautomatischen Trainingsdaten durch Methoden erhöht, die den Einfluss von falsch positiven Trainingsbeispielen verringern.
Drittens wurde ein Schwerpunkt auf feingliedrige Typmodellierung, die Expansion von Entitätennamen und das Auffinden von Entitäten gelegt, um eine größtmögliche Abdeckung von relationalen Argumenten zu erreichen.
Viertens wurde eine Reihe von neuen leistungsstarken Methoden entwickelt und untersucht, um Trainingsdaten zu erzeugen, Klassifizierungsmerkmale zu kodieren und relationale Klassifikatoren zu trainieren
- …