183 research outputs found

    Universal Schema for Knowledge Representation from Text and Structured Data

    Get PDF
    In data integration we transform information from a source into a target schema. A general problem in this task is loss of fidelity and coverage: the source expresses more knowledge than that can be fit into the target schema, or knowledge that is hard to fit into any schema at all. This problem is taken to an extreme in information extraction (IE) where the source is natural language---one of the most expressive forms of knowledge representation. To address this issue, one can either automatically learn a latent schema emergent in text (a brittle and ill-defined task), or manually define schemas. We propose instead to store data in a probabilistic representation of universal schema. This schema is simply the union of all source schemas, and we learn how to predict the cells of each source relation in this union. For example, we could store Freebase relations and relations that are expressed by natural language surface patterns. To populate such a database of universal schema, we present matrix factorization models that learn latent embedding vectors for entity tuples and relations. We show that such latent models achieve substantially higher accuracy than a traditional classification approach on New York Times and Freebase data. Besides binary relations, we use universal schema for unary relations, i.e., entity types. We explore various facets of universal schema matrix factorization models on a large-scale web corpus, including implicature among the relations. We evaluate our approach on the task of question answering using features obtained from universal schema, achieving state-of-the-art accuracy on a benchmark dataset

    The interaction of knowledge sources in word sense disambiguation

    Get PDF
    Word sense disambiguation (WSD) is a computational linguistics task likely to benefit from the tradition of combining different knowledge sources in artificial in telligence research. An important step in the exploration of this hypothesis is to determine which linguistic knowledge sources are most useful and whether their combination leads to improved results. We present a sense tagger which uses several knowledge sources. Tested accuracy exceeds 94% on our evaluation corpus.Our system attempts to disambiguate all content words in running text rather than limiting itself to treating a restricted vocabulary of words. It is argued that this approach is more likely to assist the creation of practical systems

    The role of syntactic dependencies in compositional distributional semantics

    Get PDF
    This article provides a preliminary semantic framework for Dependency Grammar in which lexical words are semantically defined as contextual distributions (sets of contexts) while syntactic dependencies are compositional operations on word distributions. More precisely, any syntactic dependency uses the contextual distribution of the dependent word to restrict the distribution of the head, and makes use of the contextual distribution of the head to restrict that of the dependent word. The interpretation of composite expressions and sentences, which are analyzed as a tree of binary dependencies, is performed by restricting the contexts of words dependency by dependency in a left-to-right incremental way. Consequently, the meaning of the whole composite expression or sentence is not a single representation, but a list of contextualized senses, namely the restricted distributions of its constituent (lexical) words. We report the results of two large-scale corpus-based experiments on two different natural language processing applications: paraphrasing and compositional translationThis work is funded by Project TELPARES, Ministry of Economy and Competitiveness (FFI2014-51978-C2-1-R), and the program “Ayuda Fundación BBVA a Investigadores y Creadores Culturales 2016”S

    Verb similarity: comparing corpus and psycholinguistic data

    Get PDF
    Similarity, which plays a key role in fields like cognitive science, psycholinguistics and natural language processing, is a broad and multifaceted concept. In this work we analyse how two approaches that belong to different perspectives, the corpus view and the psycholinguistic view, articulate similarity between verb senses in Spanish. Specifically, we compare the similarity between verb senses based on their argument structure, which is captured through semantic roles, with their similarity defined by word associations. We address the question of whether verb argument structure, which reflects the expression of the events, and word associations, which are related to the speakers' organization of the mental lexicon, shape similarity between verbs in a congruent manner, a topic which has not been explored previously. While we find significant correlations between verb sense similarities obtained from these two approaches, our findings also highlight some discrepancies between them and the importance of the degree of abstraction of the corpus annotation and psycholinguistic representations.La similitud, que desempeña un papel clave en campos como la ciencia cognitiva, la psicolingüística y el procesamiento del lenguaje natural, es un concepto amplio y multifacético. En este trabajo analizamos cómo dos enfoques que pertenecen a diferentes perspectivas, la visión del corpus y la visión psicolingüística, articulan la semejanza entre los sentidos verbales en español. Específicamente, comparamos la similitud entre los sentidos verbales basados en su estructura argumental, que se capta a través de roles semánticos, con su similitud definida por las asociaciones de palabras. Abordamos la cuestión de si la estructura del argumento verbal, que refleja la expresión de los acontecimientos, y las asociaciones de palabras, que están relacionadas con la organización de los hablantes del léxico mental, forman similitud entre los verbos de una manera congruente, un tema que no ha sido explorado previamente. Mientras que encontramos correlaciones significativas entre las similitudes de los sentidos verbales obtenidas de estos dos enfoques, nuestros hallazgos también resaltan algunas discrepancias entre ellos y la importancia del grado de abstracción de la anotación del corpus y las representaciones psicolingüísticas.La similitud, que exerceix un paper clau en camps com la ciència cognitiva, la psicolingüística i el processament del llenguatge natural, és un concepte ampli i multifacètic. En aquest treball analitzem com dos enfocaments que pertanyen a diferents perspectives, la visió del corpus i la visió psicolingüística, articulen la semblança entre els sentits verbals en espanyol. Específicament, comparem la similitud entre els sentits verbals basats en la seva estructura argumental, que es capta a través de rols semàntics, amb la seva similitud definida per les associacions de paraules. Abordem la qüestió de si l'estructura de l'argument verbal, que reflecteix l'expressió dels esdeveniments, i les associacions de paraules, que estan relacionades amb l'organització dels parlants del lèxic mental, formen similitud entre els verbs d'una manera congruent, un tema que no ha estat explorat prèviament. Mentre que trobem correlacions significatives entre les similituds dels sentits verbals obtingudes d'aquests dos enfocaments, les nostres troballes també ressalten algunes discrepàncies entre ells i la importància del grau d'abstracció de l'anotació del corpus i les representacions psicolingüístiques

    Universal schema for entity type prediction

    Full text link
    Categorizing entities by their types is useful in many applications, including knowledge base construction, relation extraction and query intent prediction. Fine-grained entity type ontologies are especially valuable, but typically difficult to design because of unavoidable quandaries about level of detail and boundary cases. Automatically classifying entities by type is challenging as well, usually involving hand-labeling data and training a supervised predictor. This paper presents a universal schema approach to fine-grained entity type prediction. The set of types is taken as the union of textual surface patterns (e.g. appositives) and pre-defined types from available databases (e.g. Freebase) - yielding not tens or hundreds of types, but more than ten thousands of entity types, such as financier, criminologist, and musical trio. We robustly learn mutual implication among this large union by learning latent vector embeddings from probabilistic matrix factorization, thus avoiding the need for hand-labeled data. Experimental results demonstrate more than 30% reduction in error versus a traditional classification approach on predicting fine-grained entities types. © 2013 ACM

    Carving verb classes from corpora

    Get PDF
    In this paper, I discuss some methodological problems arising from the use of corpus data for semantic verb classification. In particular, I present a computational framework to describe the distributional properties of Italian verbs using linguistic data automatically extracted from a large corpus. This information is used to build a distribution-based classification of a set of Italian verbs. Its small scale notwithstanding, this case study will provide evidence for the complex interplay between syntactic and semantic verb features

    D7.1. Criteria for evaluation of resources, technology and integration.

    Get PDF
    This deliverable defines how evaluation is carried out at each integration cycle in the PANACEA project. As PANACEA aims at producing large scale resources, evaluation becomes a critical and challenging issue. Critical because it is important to assess the quality of the results that should be delivered to users. Challenging because we prospect rather new areas, and through a technical platform: some new methodologies will have to be explored or old ones to be adapted

    Evaluating large-scale knowledge resources across languages

    Get PDF
    This paper presents an empirical evaluation in a multilingual scenario of the semantic knowledge present on publicly available large-scale knowledge resources. The study covers a wide range of manually and automatically derived large-scale knowledge resources for English and Spanish. In order to establish a fair and neutral comparison, the knowledge resources are evaluated using the same method on two Word Sense Disambiguation tasks (Senseval-3 English and Spanish Lexical Sample Tasks). First, this study empirically demonstrates that the combination of the knowledge contained in these resources surpass the most frequent sense classi er for English. Second, we also show that this large-scale topical knowledge acquired from one language can be successfully ported to other languages.Peer ReviewedPostprint (author’s final draft
    corecore