99 research outputs found

    Supervised and unsupervised methods for learning representations of linguistic units

    Get PDF
    Word representations, also called word embeddings, are generic representations, often high-dimensional vectors. They map the discrete space of words into a continuous vector space, which allows us to handle rare or even unseen events, e.g. by considering the nearest neighbors. Many Natural Language Processing tasks can be improved by word representations if we extend the task specific training data by the general knowledge incorporated in the word representations. The first publication investigates a supervised, graph-based method to create word representations. This method leads to a graph-theoretic similarity measure, CoSimRank, with equivalent formalizations that show CoSimRank’s close relationship to Personalized Page-Rank and SimRank. The new formalization is efficient because it can use the graph-based word representation to compute a single node similarity without having to compute the similarities of the entire graph. We also show how we can take advantage of fast matrix multiplication algorithms. In the second publication, we use existing unsupervised methods for word representation learning and combine these with semantic resources by learning representations for non-word objects like synsets and entities. We also investigate improved word representations which incorporate the semantic information from the resource. The method is flexible in that it can take any word representations as input and does not need an additional training corpus. A sparse tensor formalization guarantees efficiency and parallelizability. In the third publication, we introduce a method that learns an orthogonal transformation of the word representation space that focuses the information relevant for a task in an ultradense subspace of a dimensionality that is smaller by a factor of 100 than the original space. We use ultradense representations for a Lexicon Creation task in which words are annotated with three types of lexical information – sentiment, concreteness and frequency. The final publication introduces a new calculus for the interpretable ultradense subspaces, including polarity, concreteness, frequency and part-of-speech (POS). The calculus supports operations like “−1 × hate = love” and “give me a neutral word for greasy” (i.e., oleaginous) and extends existing analogy computations like “king − man + woman = queen”.WortreprĂ€sentationen, sogenannte Word Embeddings, sind generische ReprĂ€sentationen, meist hochdimensionale Vektoren. Sie bilden den diskreten Raum der Wörter in einen stetigen Vektorraum ab und erlauben uns, seltene oder ungesehene Ereignisse zu behandeln -- zum Beispiel durch die Betrachtung der nĂ€chsten Nachbarn. Viele Probleme der Computerlinguistik können durch WortreprĂ€sentationen gelöst werden, indem wir spezifische Trainingsdaten um die allgemeinen Informationen erweitern, welche in den WortreprĂ€sentationen enthalten sind. In der ersten Publikation untersuchen wir ĂŒberwachte, graphenbasierte Methodenn um WortreprĂ€sentationen zu erzeugen. Diese Methoden fĂŒhren zu einem graphenbasierten Ähnlichkeitsmaß, CoSimRank, fĂŒr welches zwei Ă€quivalente Formulierungen existieren, die sowohl die enge Beziehung zum personalisierten PageRank als auch zum SimRank zeigen. Die neue Formulierung kann einzelne KnotenĂ€hnlichkeiten effektiv berechnen, da graphenbasierte WortreprĂ€sentationen benutzt werden können. In der zweiten Publikation verwenden wir existierende WortreprĂ€sentationen und kombinieren diese mit semantischen Ressourcen, indem wir ReprĂ€sentationen fĂŒr Objekte lernen, welche keine Wörter sind, wie zum Beispiel Synsets und EntitĂ€ten. Die FlexibilitĂ€t unserer Methode zeichnet sich dadurch aus, dass wir beliebige WortreprĂ€sentationen als Eingabe verwenden können und keinen zusĂ€tzlichen Trainingskorpus benötigen. In der dritten Publikation stellen wir eine Methode vor, die eine Orthogonaltransformation des Vektorraums der WortreprĂ€sentationen lernt. Diese Transformation fokussiert relevante Informationen in einen ultra-kompakten Untervektorraum. Wir benutzen die ultra-kompakten ReprĂ€sentationen zur Erstellung von WörterbĂŒchern mit drei verschiedene Angaben -- Stimmung, Konkretheit und HĂ€ufigkeit. Die letzte Publikation prĂ€sentiert eine neue Rechenmethode fĂŒr die interpretierbaren ultra-kompakten UntervektorrĂ€ume -- Stimmung, Konkretheit, HĂ€ufigkeit und Wortart. Diese Rechenmethode beinhaltet Operationen wie ”−1 × Hass = Liebe” und ”neutrales Wort fĂŒr Winkeladvokat” (d.h., Anwalt) und erweitert existierende Rechenmethoden, wie ”Onkel − Mann + Frau = Tante”

    Proceedings of the Conference on Natural Language Processing 2010

    Get PDF
    This book contains state-of-the-art contributions to the 10th conference on Natural Language Processing, KONVENS 2010 (Konferenz zur Verarbeitung natĂŒrlicher Sprache), with a focus on semantic processing. The KONVENS in general aims at offering a broad perspective on current research and developments within the interdisciplinary field of natural language processing. The central theme draws specific attention towards addressing linguistic aspects ofmeaning, covering deep as well as shallow approaches to semantic processing. The contributions address both knowledgebased and data-driven methods for modelling and acquiring semantic information, and discuss the role of semantic information in applications of language technology. The articles demonstrate the importance of semantic processing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting semantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval. The collection highlights the importance of semantic processing for different areas and applications in Natural Language Processing, and provides the reader with an overview of current research in this field

    Supervised and unsupervised methods for learning representations of linguistic units

    Get PDF
    Word representations, also called word embeddings, are generic representations, often high-dimensional vectors. They map the discrete space of words into a continuous vector space, which allows us to handle rare or even unseen events, e.g. by considering the nearest neighbors. Many Natural Language Processing tasks can be improved by word representations if we extend the task specific training data by the general knowledge incorporated in the word representations. The first publication investigates a supervised, graph-based method to create word representations. This method leads to a graph-theoretic similarity measure, CoSimRank, with equivalent formalizations that show CoSimRank’s close relationship to Personalized Page-Rank and SimRank. The new formalization is efficient because it can use the graph-based word representation to compute a single node similarity without having to compute the similarities of the entire graph. We also show how we can take advantage of fast matrix multiplication algorithms. In the second publication, we use existing unsupervised methods for word representation learning and combine these with semantic resources by learning representations for non-word objects like synsets and entities. We also investigate improved word representations which incorporate the semantic information from the resource. The method is flexible in that it can take any word representations as input and does not need an additional training corpus. A sparse tensor formalization guarantees efficiency and parallelizability. In the third publication, we introduce a method that learns an orthogonal transformation of the word representation space that focuses the information relevant for a task in an ultradense subspace of a dimensionality that is smaller by a factor of 100 than the original space. We use ultradense representations for a Lexicon Creation task in which words are annotated with three types of lexical information – sentiment, concreteness and frequency. The final publication introduces a new calculus for the interpretable ultradense subspaces, including polarity, concreteness, frequency and part-of-speech (POS). The calculus supports operations like “−1 × hate = love” and “give me a neutral word for greasy” (i.e., oleaginous) and extends existing analogy computations like “king − man + woman = queen”.WortreprĂ€sentationen, sogenannte Word Embeddings, sind generische ReprĂ€sentationen, meist hochdimensionale Vektoren. Sie bilden den diskreten Raum der Wörter in einen stetigen Vektorraum ab und erlauben uns, seltene oder ungesehene Ereignisse zu behandeln -- zum Beispiel durch die Betrachtung der nĂ€chsten Nachbarn. Viele Probleme der Computerlinguistik können durch WortreprĂ€sentationen gelöst werden, indem wir spezifische Trainingsdaten um die allgemeinen Informationen erweitern, welche in den WortreprĂ€sentationen enthalten sind. In der ersten Publikation untersuchen wir ĂŒberwachte, graphenbasierte Methodenn um WortreprĂ€sentationen zu erzeugen. Diese Methoden fĂŒhren zu einem graphenbasierten Ähnlichkeitsmaß, CoSimRank, fĂŒr welches zwei Ă€quivalente Formulierungen existieren, die sowohl die enge Beziehung zum personalisierten PageRank als auch zum SimRank zeigen. Die neue Formulierung kann einzelne KnotenĂ€hnlichkeiten effektiv berechnen, da graphenbasierte WortreprĂ€sentationen benutzt werden können. In der zweiten Publikation verwenden wir existierende WortreprĂ€sentationen und kombinieren diese mit semantischen Ressourcen, indem wir ReprĂ€sentationen fĂŒr Objekte lernen, welche keine Wörter sind, wie zum Beispiel Synsets und EntitĂ€ten. Die FlexibilitĂ€t unserer Methode zeichnet sich dadurch aus, dass wir beliebige WortreprĂ€sentationen als Eingabe verwenden können und keinen zusĂ€tzlichen Trainingskorpus benötigen. In der dritten Publikation stellen wir eine Methode vor, die eine Orthogonaltransformation des Vektorraums der WortreprĂ€sentationen lernt. Diese Transformation fokussiert relevante Informationen in einen ultra-kompakten Untervektorraum. Wir benutzen die ultra-kompakten ReprĂ€sentationen zur Erstellung von WörterbĂŒchern mit drei verschiedene Angaben -- Stimmung, Konkretheit und HĂ€ufigkeit. Die letzte Publikation prĂ€sentiert eine neue Rechenmethode fĂŒr die interpretierbaren ultra-kompakten UntervektorrĂ€ume -- Stimmung, Konkretheit, HĂ€ufigkeit und Wortart. Diese Rechenmethode beinhaltet Operationen wie ”−1 × Hass = Liebe” und ”neutrales Wort fĂŒr Winkeladvokat” (d.h., Anwalt) und erweitert existierende Rechenmethoden, wie ”Onkel − Mann + Frau = Tante”

    Generating and applying textual entailment graphs for relation extraction and email categorization

    Get PDF
    Recognizing that the meaning of one text expression is semantically related to the meaning of another can be of help in many natural language processing applications. One semantic relationship between two text expressions is captured by the textual entailment paradigm, which is defined as a relation between exactly two text expressions. Entailment relations holding among a set of more than two text expressions can be captured in the form of a hierarchical knowledge structure referred to as entailment graphs. Despite the fact that several people have worked on building entailment graphs for different types of textual expressions, little research has been carried out regarding the applicability of such entailment graphs in NLP applications. This thesis fills this research gap by investigating how entailment graphs can be generated and used for addressing two specific NLP tasks: First, the task of validating automatically derived relation extraction patterns and, second, the task of automatically categorizing German customer emails. After laying a theoretical foundation, the research problem is approached in an empirical way, i.e., by drawing conclusions from analyzing, processing, and experimenting with specific task-related datasets. The experimental results show that both tasks can benefit from the integration of semantic knowledge, as expressed by entailment graphs

    Adjusting Sense Representations for Word Sense Disambiguation and Automatic Pun Interpretation

    Get PDF
    Word sense disambiguation (WSD)—the task of determining which meaning a word carries in a particular context—is a core research problem in computational linguistics. Though it has long been recognized that supervised (machine learning–based) approaches to WSD can yield impressive results, they require an amount of manually annotated training data that is often too expensive or impractical to obtain. This is a particular problem for under-resourced languages and domains, and is also a hurdle in well-resourced languages when processing the sort of lexical-semantic anomalies employed for deliberate effect in humour and wordplay. In contrast to supervised systems are knowledge-based techniques, which rely only on pre-existing lexical-semantic resources (LSRs). These techniques are of more general applicability but tend to suffer from lower performance due to the informational gap between the target word's context and the sense descriptions provided by the LSR. This dissertation is concerned with extending the efficacy and applicability of knowledge-based word sense disambiguation. First, we investigate two approaches for bridging the information gap and thereby improving the performance of knowledge-based WSD. In the first approach we supplement the word's context and the LSR's sense descriptions with entries from a distributional thesaurus. The second approach enriches an LSR's sense information by aligning it to other, complementary LSRs. Our next main contribution is to adapt techniques from word sense disambiguation to a novel task: the interpretation of puns. Traditional NLP applications, including WSD, usually treat the source text as carrying a single meaning, and therefore cannot cope with the intentionally ambiguous constructions found in humour and wordplay. We describe how algorithms and evaluation methodologies from traditional word sense disambiguation can be adapted for the "disambiguation" of puns, or rather for the identification of their double meanings. Finally, we cover the design and construction of technological and linguistic resources aimed at supporting the research and application of word sense disambiguation. Development and comparison of WSD systems has long been hampered by a lack of standardized data formats, language resources, software components, and workflows. To address this issue, we designed and implemented a modular, extensible framework for WSD. It implements, encapsulates, and aggregates reusable, interoperable components using UIMA, an industry-standard information processing architecture. We have also produced two large sense-annotated data sets for under-resourced languages or domains: one of these targets German-language text, and the other English-language puns

    Extraction of ontology schema components from financial news

    Get PDF
    In this thesis we describe an incremental multi-layer rule-based methodology for the extraction of ontology schema components from German financial newspaper text. By Extraction of Ontology Schema Components we mean the detection of new concepts and relations between these concepts for ontology building. The process of detecting concepts and relations between these concepts corresponds to the intensional part of an ontology and is often referred to as ontology learning. We present the process of rule generation for the extraction of ontology schema components as well as the application of the generated rules.In dieser Arbeit beschreiben wir eine inkrementelle mehrschichtige regelbasierte Methode fĂŒr die Extraktion von Ontologiekomponenten aus einer deutschen Wirtschaftszeitung. Die Arbeit beschreibt sowohl den Generierungsprozess der Regeln fĂŒr die Extraktion von ontologischem Wissen als auch die Anwendung dieser Regeln. Unter Extraktion von Ontologiekomponenten verstehen wir die Erkennung von neuen Konzepten und Beziehungen zwischen diesen Konzepten fĂŒr die Erstellung von Ontologien. Der Prozess der Extraktion von Konzepten und Beziehungen zwischen diesen Konzepten entspricht dem intensionalen Teil einer Ontologie und wird im Englischen Ontology Learning genannt. Im Deutschen enspricht dies dem Lernen von Ontologien

    Unsupervised German predicate entailment using the distributional inclusion hypothesis

    Get PDF
    Recognizing textual entailment is an important prerequisite to many tasks in NLP, e.g. question answering and semantic parsing. Knowing that for example buying a thing entails subsequently owning it is a relation that humans learn by interacting with the world, while machines need other ways to acquire this knowledge. Previous approaches at learning predicate entailment relations from text have focused only on English. In this thesis we present the adaptation of the unsupervised entailment graph building algorithm of Hosseini et al. to German, which can be seen as a study of challenges in language adaptation for this task in general. We create a variety of German tools necessary for this approach and give a detailed account of the challenges faced and the insights gained from them. First, we create a German relation extraction system and compare it against the English system presented by Hosseini et al. Finding that the typing of German entities constitutes a bottleneck, we create German fine-grained typing system for named and general entities. In doing so we examine the methods of annotation projection and zero-shot cross-lingual transfer, finding that for German fine-grained named entity typing zero-shot cross-lingual transfer performs best. We then move on to creating a German system that types general entities (e.g. ``ex-president'') as well as named entities (e.g. ``Obama''), by augmenting our training data with data automatically generated from a German WordNet. We find that this way up to 10 percent points improvement in general entity typing performance can be reached, while only slightly impacting named entity typing performance by 1 percent point. We use these components in the pipeline to construct German entailment graphs. We also present a method that uses German and English entailment graphs to generate training data for a supervised predicate entailment detection system, and show that this method outperforms current approaches at this task. This way we create a multilingual predicate entailment detection system, that outperforms both the monolingual German system and the zero-shot cross-lingual system on German test data, and also performs better than a monolingual English system on English test data
    • 

    corecore