68 research outputs found

    Web knowledge bases

    Get PDF
    Knowledge is key to natural language understanding. References to specific people, places and things in text are crucial to resolving ambiguity and extracting meaning. Knowledge Bases (KBs) codify this information for automated systems — enabling applications such as entity-based search and question answering. This thesis explores the idea that sites on the web may act as a KB, even if that is not their primary intent. Dedicated kbs like Wikipedia are a rich source of entity information, but are built and maintained at an ongoing cost in human effort. As a result, they are generally limited in terms of the breadth and depth of knowledge they index about entities. Web knowledge bases offer a distributed solution to the problem of aggregating entity knowledge. Social networks aggregate content about people, news sites describe events with tags for organizations and locations, and a diverse assortment of web directories aggregate statistics and summaries for long-tail entities notable within niche movie, musical and sporting domains. We aim to develop the potential of these resources for both web-centric entity Information Extraction (IE) and structured KB population. We first investigate the problem of Named Entity Linking (NEL), where systems must resolve ambiguous mentions of entities in text to their corresponding node in a structured KB. We demonstrate that entity disambiguation models derived from inbound web links to Wikipedia are able to complement and in some cases completely replace the role of resources typically derived from the KB. Building on this work, we observe that any page on the web which reliably disambiguates inbound web links may act as an aggregation point for entity knowledge. To uncover these resources, we formalize the task of Web Knowledge Base Discovery (KBD) and develop a system to automatically infer the existence of KB-like endpoints on the web. While extending our framework to multiple KBs increases the breadth of available entity knowledge, we must still consolidate references to the same entity across different web KBs. We investigate this task of Cross-KB Coreference Resolution (KB-Coref) and develop models for efficiently clustering coreferent endpoints across web-scale document collections. Finally, assessing the gap between unstructured web knowledge resources and those of a typical KB, we develop a neural machine translation approach which transforms entity knowledge between unstructured textual mentions and traditional KB structures. The web has great potential as a source of entity knowledge. In this thesis we aim to first discover, distill and finally transform this knowledge into forms which will ultimately be useful in downstream language understanding tasks

    Linking named entities to Wikipedia

    Get PDF
    Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems

    Enriching open-world knowledge graphs with expressive negative statements

    Get PDF
    Machine knowledge about entities and their relationships has been a long-standing goal for AI researchers. Over the last 15 years, thousands of public knowledge graphs have been automatically constructed from various web sources. They are crucial for use cases such as search engines. Yet, existing web-scale knowledge graphs focus on collecting positive statements, and store very little to no negatives. Due to their incompleteness, the truth of absent information remains unknown, which compromises the usability of the knowledge graph. In this dissertation: First, I make the case for selective materialization of salient negative statements in open-world knowledge graphs. Second, I present our methods to automatically infer them from encyclopedic and commonsense knowledge graphs, by locally inferring closed-world topics from reference comparable entities. I then discuss our evaluation fin-dings on metrics such as correctness and salience. Finally, I conclude with open challenges and future opportunities.Machine knowledge about entities and their relationships has been a long-standing goal for AI researchers. Over the last 15 years, thousands of public knowledge graphs have been automatically constructed from various web sources. They are crucial for use cases such as search engines. Yet, existing web-scale knowledge graphs focus on collecting positive statements, and store very little to no negatives. Due to their incompleteness, the truth of absent information remains unknown, which compromises the usability of the knowledge graph. In this dissertation: First, I make the case for selective materialization of salient negative statements in open-world knowledge graphs. Second, I present our methods to automatically infer them from encyclopedic and commonsense knowledge graphs, by locally inferring closed-world topics from reference comparable entities. I then discuss our evaluation fin-dings on metrics such as correctness and salience. Finally, I conclude with open challenges and future opportunities.Wissensgraphen ĂŒber EntitĂ€ten und ihre Attribute sind eine wichtige Komponente vieler KI-Anwendungen. Wissensgraphen im Webmaßstab speichern fast nur positive Aussagen und ĂŒbersehen negative Aussagen. Aufgrund der UnvollstĂ€ndigkeit von Open-World-Wissensgraphen werden fehlende Aussagen als unbekannt und nicht als falsch betrachtet. Diese Dissertation plĂ€diert dafĂŒr, Wissensgraphen mit informativen Aussagen anzureichern, die nicht gelten, und so ihren Mehrwert fĂŒr Anwendungen wie die Beantwortung von Fragen und die Zusammenfassung von EntitĂ€ten zu verbessern. Mit potenziell Milliarden negativer Aussagen von Kandidaten bewĂ€ltigen wir vier Hauptherausforderungen. 1. Korrektheit (oder PlausibilitĂ€t) negativer Aussagen: Unter der Open-World-Annahme (OWA) reicht es nicht aus, zu prĂŒfen, ob ein negativer Kandidat im Wissensgraphen nicht explizit als positiv angegeben ist, da es sich möglicherweise um eine fehlende Aussage handeln kann. Von entscheidender Bedeutung sind Methoden zur PrĂŒfung großer Kandidatengruppen, und zur Beseitigung falsch positiver Ergebnisse. 2. Bedeutung negativer Aussagen: Die Menge korrekter negativer Aussagen ist sehr groß, aber voller trivialer oder unsinniger Aussagen, z. B. “Eine Katze kann keine Daten speichern.”. Es sind Methoden zur Quantifizierung der Aussagekraft von Negativen erforderlich. 3. Abdeckung der Themen: AbhĂ€ngig von der Datenquelle und den Methoden zum Abrufen von Kandidaten erhalten einige Themen oder EntitĂ€ten in demWissensgraphen möglicherweise keine negativen Kandidaten. Methoden mĂŒssen die FĂ€higkeit gewĂ€hrleisten, Negative ĂŒber fast jede bestehende EntitĂ€t zu entdecken. 4. Komplexe negative Aussagen: In manchen FĂ€llen erfordert das AusdrĂŒcken einer Negation mehr als ein Wissensgraphen-Tripel. Beispielsweise ist “Einstein hat keine Ausbildung erhalten” eine inkorrekte Negation, aber “Einstein hat keine Ausbildung an einer US-amerikanischen UniversitĂ€t erhalten” ist korrekt. Es werden Methoden zur Erzeugung komplexer Negationen benötigt. Diese Dissertation geht diese Herausforderungen wie folgt an. 1. Wir plĂ€dieren zunĂ€chst fĂŒr die selektive Materialisierung negativer Aussagen ĂŒber EntitĂ€ten in enzyklopĂ€dischen (gut kanonisierten) Open-World-Wissensgraphen, und definieren formal drei Arten negativer Aussagen: fundiert, universell abwesend und konditionierte negative Aussagen. Wir stellen die Peer-basierte Negationsinferenz-Methode vor, um Listen hervorstechender Negationen ĂŒber EntitĂ€ten zu erstellen. Die Methode berechnet relevante Peers fĂŒr eine bestimmte EingabeentitĂ€t und verwendet ihre positiven Eigenschaften, um Erwartungen fĂŒr die EingabeentitĂ€t festzulegen. Eine Erwartung, die nicht erfĂŒllt ist, ist ein unmittelbar negativer Kandidat und wird dann anhand von HĂ€ufigkeits-, Wichtigkeits- und Unerwartetheitsmetriken bewertet. 2. Wir schlagen die Methode musterbasierte Abfrageprotokollextraktion vor, um hervorstechende Negationen aus umfangreichen Textquellen zu extrahieren. Diese Methode extrahiert hervorstechende Negationen ĂŒber eine EntitĂ€t, indem sie große Korpora, z.B., die Anfrageprotokolle von Suchmaschinen, unter Verwendung einiger handgefertigter Muster mit negativen SchlĂŒsselwörtern sammelt. 3. Wir fĂŒhren die UnCommonsense-Methode ein, um hervorstechende negative Phrasen ĂŒber alltĂ€gliche Konzepte in weniger kanonisierten commonsense-KGs zu generieren. Diese Methode ist fĂŒr die Negationsinferenz, PrĂŒfung und Einstufung kurzer Phrasen in natĂŒrlicher Sprache konzipiert. Sie berechnet vergleichbare Konzepte fĂŒr ein bestimmtes Zielkonzept, leitet aus dem Vergleich ihrer positiven Kandidaten Negationen ab, und prĂŒft diese Kandidaten im Vergleich zum Wissensgraphen selbst, sowie mit Sprachmodellen (LMs) als externer Wissensquelle. Schließlich werden die Kandidaten mithilfe semantischer ÄhnlichkeitserkennungshĂ€ufigkeitsmaßen eingestuft. 4. Um die Exploration unserer Methoden und ihrer Ergebnisse zu erleichtern, implementieren wir zwei Prototypensysteme. In Wikinegata wird ein System zur PrĂ€sentation der Peer-basierten Methode entwickelt, mit dem Benutzer negative Aussagen ĂŒber 500K EntitĂ€ten aus 11 Klassen untersuchen und verschiedene Parameter der Peer-basierten Inferenzmethode anpassen können. Sie können den Wissensgraphen auch mithilfe einer Suchmaske mit negierten PrĂ€dikaten befragen. Im UnCommonsense-System können Benutzer genau prĂŒfen, was die Methode bei jedem Schritt hervorbringt, sowie Negationen zu 8K alltĂ€glichen Konzepten durchsuchen. DarĂŒber hinaus erstellen wir mithilfe der Peer-basierten Negationsinferenzmethode den ersten groß angelegten Datensatz zu Demografie und Ausreißern in Interessengemeinschaften und zeigen dessen NĂŒtzlichkeit in AnwendungsfĂ€llen wie der Identifizierung unterreprĂ€sentierter Gruppen. 5. Wir veröffentlichen alle in diesen Projekten erstellten DatensĂ€tze und Quellcodes unter https://www.mpi-inf.mpg.de/negation-in-kbs und https://www.mpi-inf.mpg.de/Uncommonsense

    Biographical information extraction: A language-agnostic methodology for datasets and models

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Information extraction (IE) refers to the task of detecting and linking information contained in written texts. While it includes various subtasks, relation extraction (RE) is used to link two entities in a text via a common relation. RE can therefore be used to build linked databases of knowledge across a wide area of topics. Today, the task of RE is treated as a supervised machine learning (ML) task, where a model is trained using a specific architecture and a specific annotated dataset. These specific datasets typically aim to represent common patterns that the model is to learn, albeit at the cost of manual annotation, which can be costly and time-consuming. In addition, due to the nature of the training process, the models can be sensitive to a specific genre or topic, and are generally monolingual. It therefore stands to reason, that certain genres and topics have better models, as they are treated with a higher priority due to financial interests for instance. This in turn leads to RE models not being available to every area of research, leaving incomplete linked databases of knowledge. For instance, if the birthplace of a person is not correctly extracted, the place and the person can not be linked correctly, therefore not leaving linked databases incomplete. To address this problem, this thesis explores aspects of RE that could be adapted in ways which require little human effort, therefore making RE models more widely available. The first aspect is the annotated data. During the course of this thesis, Wikipedia and its subsidiaries are used as sources to automatically annotate sentences for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), information is matched with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain: birthdate, birthplace, deathdate, deathplace, occupation, parent, educated, child, sibling and other (all other relations). Furthermore, the effectiveness of the dataset is demonstrated by training a state-of-the-art neural model to classify relation pairs. For its evaluation, a manually annotated gold standard set is used. An investigation of the necessary adaptations to recreate the automatic process in a multilingual setting is also undertaken, looking specifically at English and German, for which similar neural models are trained and evaluated on a gold standard dataset. While the process is aimed here at training neural models for RE within the domain of digital humanities and history, it may be transferable to other domains

    Unmet goals of tracking: within-track heterogeneity of students' expectations for

    Get PDF
    Educational systems are often characterized by some form(s) of ability grouping, like tracking. Although substantial variation in the implementation of these practices exists, it is always the aim to improve teaching efficiency by creating homogeneous groups of students in terms of capabilities and performances as well as expected pathways. If students’ expected pathways (university, graduate school, or working) are in line with the goals of tracking, one might presume that these expectations are rather homogeneous within tracks and heterogeneous between tracks. In Flanders (the northern region of Belgium), the educational system consists of four tracks. Many students start out in the most prestigious, academic track. If they fail to gain the necessary credentials, they move to the less esteemed technical and vocational tracks. Therefore, the educational system has been called a 'cascade system'. We presume that this cascade system creates homogeneous expectations in the academic track, though heterogeneous expectations in the technical and vocational tracks. We use data from the International Study of City Youth (ISCY), gathered during the 2013-2014 school year from 2354 pupils of the tenth grade across 30 secondary schools in the city of Ghent, Flanders. Preliminary results suggest that the technical and vocational tracks show more heterogeneity in student’s expectations than the academic track. If tracking does not fulfill the desired goals in some tracks, tracking practices should be questioned as tracking occurs along social and ethnic lines, causing social inequality

    Leveraging Longitudinal Data for Personalized Prediction and Word Representations

    Full text link
    This thesis focuses on personalization, word representations, and longitudinal dialog. We first look at users expressions of individual preferences. In this targeted sentiment task, we find that we can improve entity extraction and sentiment classification using domain lexicons and linear term weighting. This task is important to personalization and dialog systems, as targets need to be identified in conversation and personal preferences affect how the system should react. Then we examine individuals with large amounts of personal conversational data in order to better predict what people will say. We consider extra-linguistic features that can be used to predict behavior and to predict the relationship between interlocutors. We show that these features improve over just using message content and that training on personal data leads to much better performance than training on a sample from all other users. We look not just at using personal data for these end-tasks, but also constructing personalized word representations. When we have a lot of data for an individual, we create personalized word embeddings that improve performance on language modeling and authorship attribution. When we have limited data, but we have user demographics, we can instead construct demographic word embeddings. We show that these representations improve language modeling and word association performance. When we do not have demographic information, we show that using a small amount of data from an individual, we can calculate similarity to existing users and interpolate or leverage data from these users to improve language modeling performance. Using these types of personalized word representations, we are able to provide insight into what words vary more across users and demographics. The kind of personalized representations that we introduce in this work allow for applications such as predictive typing, style transfer, and dialog systems. Importantly, they also have the potential to enable more equitable language models, with improved performance for those demographic groups that have little representation in the data.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167971/1/cfwelch_1.pd

    Ubiquity

    Get PDF
    From its invention to the internet age, photography has been considered universal, pervasive, and omnipresent. This anthology of essays posits how the question of when photography came to be everywhere shapes our understanding of all manner of photographic media. Whether looking at a portrait image on the polished silver surface of the daguerreotype, or a viral image on the reflective glass of the smartphone, the experience of looking at photographs and thinking with photography is inseparable from the idea of ubiquity—that is, the apparent ability to be everywhere at once. While photography’s distribution across cultures today is undeniable, the insidious logics and pervasive myths that have governed its spread demand our critical attention, now more than ever

    Towards a paracontextual practice* : (*with footnotes to Parallel of life and art)

    Get PDF
    PhD Thesis (images have been removed from E-version for copyright- request print to see images)This thesis concerns the question of the site for spatial practice. Drawing on Carol Burns’ and Andrea Kahn’s notions of ‘cleared’, ‘constructed’ and ‘overlooked’ sites within architecture, it proposes that a site is a construct of an array of contextual traces beyond perceptible boundaries, opening up to other sites, and asks how might phenomena not immediately present be acknowledged, in order to develop a practice for analysing the ‘empty’ site? The thesis turns toward forms of spatial writing as developed by Jane Rendell and others, and to GĂ©rard Genette’s literary theory of paratext — which explores the marginal elements of a literary composition, including footnotes — to develop a new practice that is paracontextual. Whilst artists and writers have acknowledged and interrogated these phenomena within their own works, this thesis asks: what potential is offered by an interdisciplinary translation of these methods to spatial practice (practices between art and architecture)? Paratextuality is explored here as a spatial phenomenon in relation to the Independent Group’s exhibition Parallel of Life and Art (ICA, London, 1953). The exhibition’s ‘Editors’ (including photographer Nigel Henderson and architects Alison and Peter Smithson) gathered figures from numerous publications (including National Geographic Magazine, Journal of Iron and Steel Industry, and Life Magazine) as a spatialisation of sources, but the images were mounted without wall labels — each source credited only within a supplementary (paratextual) catalogue. It was in the process of studying the installation photographs that I discovered two figures had disappeared from the gallery walls. By coincidence, these images were both of sites, and of voids: the excavation site for a skyscraper, and a meteor crater. The thesis is structured in two parts. A detailed study builds on the work of critics, writers and artists such as Robert Smithson, Sophie Calle, Emma Cocker, and Marlene Creates to propose possible paracontextual practices that extend beyond the literary limitations of Genette’s paratextual phenomena. A paracontextual practice is developed in response to the empty sites of the missing figures of the Parallel of Life and Art exhibition. The missing images provide an ‘empty’ site from which a fictional exhibition, Craters, and an accompanying catalogue are represented through a series of textual–spatial explorations, which extend from these images to the bomb–sites of post–war London beyond the original Parallel of Life and Art gallery, and to the Smithsons’ own theories in relation to holes within the city. On the one hand, the thesis presents a new paratextual interpretation of the Parallel of Life and Art exhibition, but on the other, as paracontextual practice the textual–spatial explorations of the Craters exhibition and catalogue are offered as a model that could be developed to account for the para– phenomena — the supplements, the sources, the craters — of other ‘empty’ sites
    • 

    corecore