136 research outputs found

    Aspects of Coherence for Entity Analysis

    Get PDF
    Natural language understanding is an important topic in natural language proces- sing. Given a text, a computer program should, at the very least, be able to under- stand what the text is about, and ideally also situate it in its extra-textual context and understand what purpose it serves. What exactly it means to understand what a text is about is an open question, but it is generally accepted that, at a minimum, un- derstanding involves being able to answer questions like “Who did what to whom? Where? When? How? And Why?”. Entity analysis, the computational analysis of entities mentioned in a text, aims to support answering the questions “Who?” and “Whom?” by identifying entities mentioned in a text. If the answers to “Where?” and “When?” are specific, named locations and events, entity analysis can also pro- vide these answers. Entity analysis aims to answer these questions by performing entity linking, that is, linking mentions of entities to their corresponding entry in a knowledge base, coreference resolution, that is, identifying all mentions in a text that refer to the same entity, and entity typing, that is, assigning a label such as Person to mentions of entities. In this thesis, we study how different aspects of coherence can be exploited to improve entity analysis. Our main contribution is a method that allows exploiting knowledge-rich, specific aspects of coherence, namely geographic, temporal, and entity type coherence. Geographic coherence expresses the intuition that entities mentioned in a text tend to be geographically close. Similarly, temporal coherence captures the intuition that entities mentioned in a text tend to be close in the tem- poral dimension. Entity type coherence is based in the observation that in a text about a certain topic, such as sports, the entities mentioned in it tend to have the same or related entity types, such as sports team or athlete. We show how to integrate features modeling these aspects of coherence into entity linking systems and esta- blish their utility in extensive experiments covering different datasets and systems. Since entity linking often requires computationally expensive joint, global optimi- zation, we propose a simple, but effective rule-based approach that enjoys some of the benefits of joint, global approaches, while avoiding some of their drawbacks. To enable convenient error analysis for system developers, we introduce a tool for visual analysis of entity linking system output. Investigating another aspect of co- herence, namely the coherence between a predicate and its arguments, we devise a distributed model of selectional preferences and assess its impact on a neural core- ference resolution system. Our final contribution examines how multilingual entity typing can be improved by incorporating subword information. We train and make publicly available subword embeddings in 275 languages and show their utility in a multilingual entity typing tas

    Automatically acquiring a semantic network of related concepts

    Get PDF
    ABSTRACT We describe the automatic construction of a semantic network 1 , in which over 3000 of the most frequently occurring monosemous nouns 2 in Wikipedia (each appearing between 1,500 and 100,000 times) are linked to their semantically related concepts in the WordNet noun ontology. Relatedness between nouns is discovered automatically from cooccurrence in Wikipedia texts using an information theoretic inspired measure. Our algorithm then capitalizes on salient sense clustering among related nouns to automatically disambiguate them to their appropriate senses (i.e., concepts). Through the act of disambiguation, we begin to accumulate relatedness data for concepts denoted by polysemous nouns, as well. The resultant concept-to-concept associations, covering 17,543 nouns, and 27,312 distinct senses among them, constitute a large-scale semantic network of related concepts that can be conceived of as augmenting the WordNet noun ontology with related-to links

    A distributional investigation of German verbs

    Get PDF
    Diese Dissertation bietet eine empirische Untersuchung deutscher Verben auf der Grundlage statistischer Beschreibungen, die aus einem großen deutschen Textkorpus gewonnen wurden. In einem kurzen Überblick über linguistische Theorien zur lexikalischen Semantik von Verben skizziere ich die Idee, dass die Verbbedeutung wesentlich von seiner Argumentstruktur (der Anzahl und Art der Argumente, die zusammen mit dem Verb auftreten) und seiner Aspektstruktur (Eigenschaften, die den zeitlichen Ablauf des vom Verb denotierten Ereignisses bestimmen) abhängt. Anschließend erstelle ich statistische Beschreibungen von Verben, die auf diesen beiden unterschiedlichen Bedeutungsfacetten basieren. Insbesondere untersuche ich verbale Subkategorisierung, Selektionspräferenzen und Aspekt. Alle diese Modellierungsstrategien werden anhand einer gemeinsamen Aufgabe, der Verbklassifikation, bewertet. Ich zeige, dass im Rahmen von maschinellem Lernen erworbene Merkmale, die verbale lexikalische Aspekte erfassen, für eine Anwendung von Vorteil sind, die Argumentstrukturen betrifft, nämlich semantische Rollenkennzeichnung. Darüber hinaus zeige ich, dass Merkmale, die die verbale Argumentstruktur erfassen, bei der Aufgabe, ein Verb nach seiner Aspektklasse zu klassifizieren, gut funktionieren. Diese Ergebnisse bestätigen, dass diese beiden Facetten der Verbbedeutung auf grundsätzliche Weise zusammenhängen.This dissertation provides an empirical investigation of German verbs conducted on the basis of statistical descriptions acquired from a large corpus of German text. In a brief overview of the linguistic theory pertaining to the lexical semantics of verbs, I outline the idea that verb meaning is composed of argument structure (the number and types of arguments that co-occur with a verb) and aspectual structure (properties describing the temporal progression of an event referenced by the verb). I then produce statistical descriptions of verbs according to these two distinct facets of meaning: In particular, I examine verbal subcategorisation, selectional preferences, and aspectual type. All three of these modelling strategies are evaluated on a common task, automatic verb classification. I demonstrate that automatically acquired features capturing verbal lexical aspect are beneficial for an application that concerns argument structure, namely semantic role labelling. Furthermore, I demonstrate that features capturing verbal argument structure perform well on the task of classifying a verb for its aspectual type. These findings suggest that these two facets of verb meaning are related in an underlying way

    Entitate izendunen desanbiguazioa ezagutza-base erraldoien arabera

    Get PDF
    130 p.Gaur egun, interneten nabigatzeko orduan, ia-ia ezinbestekoak dira bilatza-ileak, eta guztietatik ezagunena Google da. Bilatzaileek egungo arrakastarenzati handi bat ezagutza-baseen ustiaketatik eskuratu dute. Izan ere, bilaketasemantikoekin kontsulta soilak ezagutza-baseetako informazioaz aberastekogai dira. Esate baterako, musika talde bati buruzko informazioa bilatzean,bere diskografia edo partaideetara esteka gehigarriak eskaintzen dituzte. Her-rialde bateko lehendakariari buruzko informazioa bilatzean, lehendakari izan-dakoen estekak edo lurralde horretako informazio gehigarria eskaintzen dute.Hala ere, gaur egun pil-pilean dauden bilaketa semantikoen arrakasta kolokanjarriko duen arazoa existitzen da. Termino anbiguoek ezagutza-baseetatikeskuratuko den informazioaren egokitasuna baldintzatuko dute. Batez ere,arazo handienak izen berezien edo entitate izendunen aipamenek sortuko di-tuzte.Tesi-lan honen helburu nagusia entitate izendunen desanbiguazioa (EID)aztertu, eta hau burutzeko teknika berriak proposatzea da. EID sistemektestuetako izen-aipamenak desanbiguatu, eta ezagutza-baseetako entitateekinlotuko dituzte. Izen-aipamenen izaera anbiguoa dela eta, hainbat entitateizendatu ditzakete. Gainera, entitate berdina hainbat izen ezberdinekinizendatu daiteke, beraz, aipamen hauek egoki desanbiguatzea tesiaren gakoaizango da.Horretarako, lehenik, arloaren egoeraren oinarri diren bi desanbiguazioeredu aztertuko dira. Batetik, ezagutza-baseen egituraz baliatzen den ereduvglobala, eta bestetik, aipamenaren testuinguruko hitzen informazioa usti-atzen duen eredu lokala. Ondoren, bi informazio iturriak modu osagarriankonbinatuko dira. Konbinazioak arloaren egoerako emaitzak hainbat datu-multzo ezberdinetan gaindituko ditu, eta gainontzekoetan pareko emaitzaklortuko ditu.Bigarrenik, edozein desanbiguazio-sistema hobetzeko helburuarekin ideiaberritzaileak proposatu, aztertu eta ebaluatu dira. Batetik, diskurtso, bil-duma eta agerkidetza mailan entitateen portaera aztertu da, entitateek pa-troi jakin bat betetzen dutela baieztatuz. Ondoren, patroi horretan oinar-rituz eredu globalaren, lokalaren eta beste EID sistema baten emaitzak moduadierazgarrian hobetu dira. Bestetik, eredu lokala kanpotiko corpusetatik es-kuratutako ezagutzarekin elikatu da. Ekarpen honekin kanpo-ezagutza honenkalitatea ebaluatu da sistemari egiten dion ekarpena justifikatuz. Gainera,eredu lokalaren emaitzak hobetzea lortu da, berriz ere arloaren egoerakobalioak eskuratuz.Tesia artikuluen bilduma gisa aurkeztuko da. Sarrera eta arloaren ego-era azaldu ondoren, tesiaren oinarri diren ingelesezko lau artikulu erantsikodira. Azkenik, lau artikuluetan jorratu diren gaiak biltzeko ondorio orokorrakplanteatuko dira

    Automatically Acquiring A Semantic Network Of Related Concepts

    Get PDF
    We describe the automatic acquisition of a semantic network in which over 7,500 of the most frequently occurring nouns in the English language are linked to their semantically related concepts in the WordNet noun ontology. Relatedness between nouns is discovered automatically from lexical co-occurrence in Wikipedia texts using a novel adaptation of an information theoretic inspired measure. Our algorithm then capitalizes on salient sense clustering among these semantic associates to automatically disambiguate them to their corresponding WordNet noun senses (i.e., concepts). The resultant concept-to-concept associations, stemming from 7,593 target nouns, with 17,104 distinct senses among them, constitute a large-scale semantic network with 208,832 undirected edges between related concepts. Our work can thus be conceived of as augmenting the WordNet noun ontology with RelatedTo links. The network, which we refer to as the Szumlanski-Gomez Network (SGN), has been subjected to a variety of evaluative measures, including manual inspection by human judges and quantitative comparison to gold standard data for semantic relatedness measurements. We have also evaluated the network’s performance in an applied setting on a word sense disambiguation (WSD) task in which the network served as a knowledge source for established graph-based spreading activation algorithms, and have shown: a) the network is competitive with WordNet when used as a stand-alone knowledge source for WSD, b) combining our network with WordNet achieves disambiguation results that exceed the performance of either resource individually, and c) our network outperforms a similar resource, WordNet++ (Ponzetto & Navigli, 2010), that has been automatically derived from annotations in the Wikipedia corpus. iii Finally, we present a study on human perceptions of relatedness. In our study, we elicited quantitative evaluations of semantic relatedness from human subjects using a variation of the classical methodology that Rubenstein and Goodenough (1965) employed to investigate human perceptions of semantic similarity. Judgments from individual subjects in our study exhibit high average correlation to the elicited relatedness means using leave-one-out sampling (r = 0.77, σ = 0.09, N = 73), although not as high as average human correlation in previous studies of similarity judgments, for which Resnik (1995) established an upper bound of r = 0.90 (σ = 0.07, N = 10). These results suggest that human perceptions of relatedness are less strictly constrained than evaluations of similarity, and establish a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. We also contrast the performance of a variety of similarity and relatedness measures on our dataset to their performance on similarity norms and introduce our own dataset as a supplementary evaluative standard for relatedness measures

    D6.2 Integrated Final Version of the Components for Lexical Acquisition

    Get PDF
    The PANACEA project has addressed one of the most critical bottlenecks that threaten the development of technologies to support multilingualism in Europe, and to process the huge quantity of multilingual data produced annually. Any attempt at automated language processing, particularly Machine Translation (MT), depends on the availability of language-specific resources. Such Language Resources (LR) contain information about the language\u27s lexicon, i.e. the words of the language and the characteristics of their use. In Natural Language Processing (NLP), LRs contribute information about the syntactic and semantic behaviour of words - i.e. their grammar and their meaning - which inform downstream applications such as MT. To date, many LRs have been generated by hand, requiring significant manual labour from linguistic experts. However, proceeding manually, it is impossible to supply LRs for every possible pair of European languages, textual domain, and genre, which are needed by MT developers. Moreover, an LR for a given language can never be considered complete nor final because of the characteristics of natural language, which continually undergoes changes, especially spurred on by the emergence of new knowledge domains and new technologies. PANACEA has addressed this challenge by building a factory of LRs that progressively automates the stages involved in the acquisition, production, updating and maintenance of LRs required by MT systems. The existence of such a factory will significantly cut down the cost, time and human effort required to build LRs. WP6 has addressed the lexical acquisition component of the LR factory, that is, the techniques for automated extraction of key lexical information from texts, and the automatic collation of lexical information into LRs in a standardized format. The goal of WP6 has been to take existing techniques capable of acquiring syntactic and semantic information from corpus data, improving upon them, adapting and applying them to multiple languages, and turning them into powerful and flexible techniques capable of supporting massive applications. One focus for improving the scalability and portability of lexical acquisition techniques has been to extend exiting techniques with more powerful, less "supervised" methods. In NLP, the amount of supervision refers to the amount of manual annotation which must be applied to a text corpus before machine learning or other techniques are applied to the data to compile a lexicon. More manual annotation means more accurate training data, and thus a more accurate LR. However, given that it is impractical from a cost and time perspective to manually annotate the vast amounts of data required for multilingual MT across domains, it is important to develop techniques which can learn from corpora with less supervision. Less supervised methods are capable of supporting both large-scale acquisition and efficient domain adaptation, even in the domains where data is scarce. Another focus of lexical acquisition in PANACEA has been the need of LR users to tune the accuracy level of LRs. Some applications may require increased precision, or accuracy, where the application requires a high degree of confidence in the lexical information used. At other times a greater level of coverage may be required, with information about more words at the expense of some degree of accuracy. Lexical acquisition in PANACEA has investigated confidence thresholds for lexical acquisition to ensure that the ultimate users of LRs can generate lexical data from the PANACEA factory at the desired level of accuracy

    Bilingual distributed word representations from document-aligned comparable data

    Get PDF
    We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and contextcounting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.This work was done while Ivan Vuli c was a postdoctoral researcher at Department of Computer Science, KU Leuven supported by the PDM Kort fellowship (PDMK/14/117). The work was also supported by the SCATE project (IWT-SBO 130041) and the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (648909)

    Entitate izendunen desanbiguazioa ezagutza-base erraldoien arabera

    Get PDF
    130 p.Gaur egun, interneten nabigatzeko orduan, ia-ia ezinbestekoak dira bilatza-ileak, eta guztietatik ezagunena Google da. Bilatzaileek egungo arrakastarenzati handi bat ezagutza-baseen ustiaketatik eskuratu dute. Izan ere, bilaketasemantikoekin kontsulta soilak ezagutza-baseetako informazioaz aberastekogai dira. Esate baterako, musika talde bati buruzko informazioa bilatzean,bere diskografia edo partaideetara esteka gehigarriak eskaintzen dituzte. Her-rialde bateko lehendakariari buruzko informazioa bilatzean, lehendakari izan-dakoen estekak edo lurralde horretako informazio gehigarria eskaintzen dute.Hala ere, gaur egun pil-pilean dauden bilaketa semantikoen arrakasta kolokanjarriko duen arazoa existitzen da. Termino anbiguoek ezagutza-baseetatikeskuratuko den informazioaren egokitasuna baldintzatuko dute. Batez ere,arazo handienak izen berezien edo entitate izendunen aipamenek sortuko di-tuzte.Tesi-lan honen helburu nagusia entitate izendunen desanbiguazioa (EID)aztertu, eta hau burutzeko teknika berriak proposatzea da. EID sistemektestuetako izen-aipamenak desanbiguatu, eta ezagutza-baseetako entitateekinlotuko dituzte. Izen-aipamenen izaera anbiguoa dela eta, hainbat entitateizendatu ditzakete. Gainera, entitate berdina hainbat izen ezberdinekinizendatu daiteke, beraz, aipamen hauek egoki desanbiguatzea tesiaren gakoaizango da.Horretarako, lehenik, arloaren egoeraren oinarri diren bi desanbiguazioeredu aztertuko dira. Batetik, ezagutza-baseen egituraz baliatzen den ereduvglobala, eta bestetik, aipamenaren testuinguruko hitzen informazioa usti-atzen duen eredu lokala. Ondoren, bi informazio iturriak modu osagarriankonbinatuko dira. Konbinazioak arloaren egoerako emaitzak hainbat datu-multzo ezberdinetan gaindituko ditu, eta gainontzekoetan pareko emaitzaklortuko ditu.Bigarrenik, edozein desanbiguazio-sistema hobetzeko helburuarekin ideiaberritzaileak proposatu, aztertu eta ebaluatu dira. Batetik, diskurtso, bil-duma eta agerkidetza mailan entitateen portaera aztertu da, entitateek pa-troi jakin bat betetzen dutela baieztatuz. Ondoren, patroi horretan oinar-rituz eredu globalaren, lokalaren eta beste EID sistema baten emaitzak moduadierazgarrian hobetu dira. Bestetik, eredu lokala kanpotiko corpusetatik es-kuratutako ezagutzarekin elikatu da. Ekarpen honekin kanpo-ezagutza honenkalitatea ebaluatu da sistemari egiten dion ekarpena justifikatuz. Gainera,eredu lokalaren emaitzak hobetzea lortu da, berriz ere arloaren egoerakobalioak eskuratuz.Tesia artikuluen bilduma gisa aurkeztuko da. Sarrera eta arloaren ego-era azaldu ondoren, tesiaren oinarri diren ingelesezko lau artikulu erantsikodira. Azkenik, lau artikuluetan jorratu diren gaiak biltzeko ondorio orokorrakplanteatuko dira
    corecore