353 research outputs found

    MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach

    Full text link
    Entity linking has recently been the subject of a significant body of research. Currently, the best performing approaches rely on trained mono-lingual models. Porting these approaches to other languages is consequently a difficult endeavor as it requires corresponding training data and retraining of the models. We address this drawback by presenting a novel multilingual, knowledge-based agnostic and deterministic approach to entity linking, dubbed MAG. MAG is based on a combination of context-based retrieval on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data sets and in 7 languages. Our results show that the best approach trained on English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse on datasets in other languages. MAG, on the other hand, achieves state-of-the-art performance on English datasets and reaches a micro F-measure that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc

    Finding Streams in Knowledge Graphs to Support Fact Checking

    Full text link
    The volume and velocity of information that gets generated online limits current journalistic practices to fact-check claims at the same rate. Computational approaches for fact checking may be the key to help mitigate the risks of massive misinformation spread. Such approaches can be designed to not only be scalable and effective at assessing veracity of dubious claims, but also to boost a human fact checker's productivity by surfacing relevant facts and patterns to aid their analysis. To this end, we present a novel, unsupervised network-flow based approach to determine the truthfulness of a statement of fact expressed in the form of a (subject, predicate, object) triple. We view a knowledge graph of background information about real-world entities as a flow network, and knowledge as a fluid, abstract commodity. We show that computational fact checking of such a triple then amounts to finding a "knowledge stream" that emanates from the subject node and flows toward the object node through paths connecting them. Evaluation on a range of real-world and hand-crafted datasets of facts related to entertainment, business, sports, geography and more reveals that this network-flow model can be very effective in discerning true statements from false ones, outperforming existing algorithms on many test cases. Moreover, the model is expressive in its ability to automatically discover several useful path patterns and surface relevant facts that may help a human fact checker corroborate or refute a claim.Comment: Extended version of the paper in proceedings of ICDM 201

    FICLONE: Improving DBpedia Spotlight Using Named Entity Recognition and Collective Disambiguation

    Get PDF
    In this paper we present FICLONE, which aims to improve the performance of DBpedia Spotlight, not only for the task of semantic annotation (SA), but also for the sub-task of named entity disambiguation (NED). To achieve this aim, first we enhance the spotting phase by combining a named entity recognition system (Stanford NER ) with the results of DBpedia Spotlight. Second, we improve the disambiguation phase by using coreference resolution and exploiting a lexicon that associates a list of potential entities of Wikipedia to surface forms. Finally, to select the correct entity among the candidates found for one mention, FICLONE relies on collective disambiguation, an approach that has proved successful in many other annotators, and that takes into consideration the other mentions in the text. Our experiments show that FICLONE not only substantially improves the performance of DBpedia Spotlight for the NED sub-task but also generally outperforms other state-of-the-art systems. For the SA sub-task, FICLONE also outperforms DBpedia Spotlight against the dataset provided by the DBpedia Spotlight team

    Linked Data Supported Information Retrieval

    Get PDF
    Um Inhalte im World Wide Web ausfindig zu machen, sind Suchmaschienen nicht mehr wegzudenken. Semantic Web und Linked Data Technologien ermöglichen ein detaillierteres und eindeutiges Strukturieren der Inhalte und erlauben vollkommen neue Herangehensweisen an die Lösung von Information Retrieval Problemen. Diese Arbeit befasst sich mit den Möglichkeiten, wie Information Retrieval Anwendungen von der Einbeziehung von Linked Data profitieren können. Neue Methoden der computer-gestützten semantischen Textanalyse, semantischen Suche, Informationspriorisierung und -visualisierung werden vorgestellt und umfassend evaluiert. Dabei werden Linked Data Ressourcen und ihre Beziehungen in die Verfahren integriert, um eine Steigerung der Effektivität der Verfahren bzw. ihrer Benutzerfreundlichkeit zu erzielen. Zunächst wird eine Einführung in die Grundlagen des Information Retrieval und Linked Data gegeben. Anschließend werden neue manuelle und automatisierte Verfahren zum semantischen Annotieren von Dokumenten durch deren Verknüpfung mit Linked Data Ressourcen vorgestellt (Entity Linking). Eine umfassende Evaluation der Verfahren wird durchgeführt und das zu Grunde liegende Evaluationssystem umfangreich verbessert. Aufbauend auf den Annotationsverfahren werden zwei neue Retrievalmodelle zur semantischen Suche vorgestellt und evaluiert. Die Verfahren basieren auf dem generalisierten Vektorraummodell und beziehen die semantische Ähnlichkeit anhand von taxonomie-basierten Beziehungen der Linked Data Ressourcen in Dokumenten und Suchanfragen in die Berechnung der Suchergebnisrangfolge ein. Mit dem Ziel die Berechnung von semantischer Ähnlichkeit weiter zu verfeinern, wird ein Verfahren zur Priorisierung von Linked Data Ressourcen vorgestellt und evaluiert. Darauf aufbauend werden Visualisierungstechniken aufgezeigt mit dem Ziel, die Explorierbarkeit und Navigierbarkeit innerhalb eines semantisch annotierten Dokumentenkorpus zu verbessern. Hierfür werden zwei Anwendungen präsentiert. Zum einen eine Linked Data basierte explorative Erweiterung als Ergänzung zu einer traditionellen schlüsselwort-basierten Suchmaschine, zum anderen ein Linked Data basiertes Empfehlungssystem

    Semantic web and semantic technologies to enhance innovation and technology watch processes

    Get PDF
    Innovation is a key process for Small and Medium Enterprises in order to survive and evolve in a competitive environment. Ideas and idea management are considered the basis for Innovation. Gathering data on how current technologies and competitors evolve is another key factor for companies' innovation. Therefore, this thesis focuses the application of Information and Communication Technologies and more specifically Semantic Web and Semantic Technologies on Idea Management Systems and Technology Watch Systems. Innovation and Technology Watch platform managers usually face many problems related with the data they collect and manage. Those managers have to deal with a large amount of information distributed in different platforms, not always interoperable among them. It is vital to share data between platforms so it can be converted into knowledge. Many of the tasks they perform are non productive and too much time and effort is expended on them. Moreover, Innovation process managers have difficulties in identifying why an idea contest has been successful. Our proposal is to analyze different Information and Communication Technologies that can assist companies with their Innovation and Technology Watch processes. Thus, we studied several Semantic and Web technologies, we build some conceptual models and tested them in different case studies to see the results achieved in real scenarios. The outcome of this thesis has been the creation of a solution architecture to enable interoperability among platforms and to ease the work of the process' managers. In this framework and to complement the architecture, two ontologies have been developed: (1) Gi2Mo Wave and (2) Mentions Ontology. On one hand, Gi2Mo Wave focused on annotating the background of idea contests, assisting on the analysis of the contests and easing its replication. On the other hand, Mentions Ontology focused on annotating the elements mentioned in plain text content, such as ideas or news items. That way, Mentions Ontology creates a way to link the related content, enabling the interoperability among content from different platforms. In order to test the architecture, a new web Idea Management System and a Technology Watch system have been also developed. The platforms incorporate semantic ontologies and tools to enable interoperability. We also demonstrate how Semantic Technologies reduce human workload by contributing on the automatic classification of content in the Technology Watch process. Finally, conclusions have been gathered according to the results achieved testing the used technologies, identifying the ones with best results.Berrikuntza prozesu oso garrantzitsu bat da Enpresa Txiki eta Ertainen lehiakor eta bizirik irauteko ingurumen lehiakor batean. Berrikuntza prozesuek ideiak eta ideien kudeaketa dituzte oinarri gisa. Teknologiek eta lehiakideek nola eboluzionatzen duten jakitzea ere garrantzitsua da enpresen berrikuntzarako, eta baita ere informazio hori kudeatzea. Beraz, Informazio eta Komunikazio sistemen aplikazioan oinarritzen da tesi hau, zehazkiago Web Semantika eta Teknologia Semantikoetan eta hauen aplikazioa Ideia Kudeaketa eta Zaintza Teknologikoko sistemetan. Berrikuntza eta Zaintza Teknologikoko plataformen kudeatzaileek arazo larriak izaten dituzte jasotako datuekin eta haien kudeaketarekin. Kudeatzaile horiek plataforma ezberdinetan banatutako informazio kantitate handi batekin topo egiten dute eta plataforma horiek ez dira beti elkar eraginkorrak. Beraz, beharrezkoa da plataforma ezberdinetako datuak elkarren artean partekatzea gero datu horiek “ezagutza” bihurtzeko. Gainera, kudeatzaileek egiten dituzten zeregin kopuru handi bat zeregin ez emankorrak dira, denbora eta esfortzu handia suposatzen dute baliozko ezer gehitu gabe. Eta ez hori bakarrik, berrikuntza prozesuko kudeatzaileek zail izaten dute ideia lehiaketen arrakastaren arrazoiak identifikatzen. Gure proposamena Informazio eta Komunikazio Teknologia ezberdinak frogatzea da enpresen berrikuntzako eta zaintza teknologikoko prozesuetan laguntzeko. Honela, hainbat teknologia semantiko eta web teknologia aztertu dira, modelo kontzeptual batzuk eraikitzen eta probatzen benetako erabilpen kasutan lortutako emaitzak konprobatzeko. Tesi honen lorpena plataformen arteko elkar eraginkortasuna ahalbidetzen duen eta prozesuen kudeatzaileen lana errazten duen modelo baten sorpena izan da. Horrela eta sortutako modeloa konplimentatzeko, bi ontologia sortu dira: (1) Gi2Mo- Wave eta (2) Mentions Ontology. Alde batetik, Gi2Mo-Wave ontologia ideien eta ideia lehiaketen testuinguruaren errepresentazio semantikoan oinarritu da. Horrela testuinguruaren analisia errazten da, ideia lehiaketa arrakastatsuak errepikatzea ere errazagoa eginez. Bestalde, Mentions-Ontology ontologia eduki ezberdinen (ideiak edo berriak adibidez) testuetan aipatutako elementuen errepresentazio semantikoan oinarritu da. Horrela, Mentions Ontology ontologiak edukia elkar konektatzeko era bat sortzen du, plataforma ezberdinen edukiaren arteko elkar eraginkortasuna ahalbidetzen. Modelo edo arkitektura hau frogatzeko, Ideia Kudeaketa Sistema eta Zaintza teknologikoko web plataforma berri batzuk garatu dira ere. Plataforma hauek tresna eta ontologia semantikoak dituzte txertatuta, beraien arteko elkar eraginkortasuna ahalbidetzeko. Gainera, teknologia semantikoen aplikazioarekin giza lan kargaren murrizketa nola gauzatu ere frogatzen dugu, Zaintza Teknologikoko edukiaren klasifikazio automatikoan ekarpenak eginez. Bukatzeko, konklusioak bildu dira erabili diren teknologien frogetatik jasotako emaitzetan oinarrituta eta emaitza onenak lortu dituztenak identifikatu dira.El proceso de Innovación es un proceso clave para la supervivencia y evolución de las Pequeñas y Medianas Empresas en un entorno competitivo. Las ideas y la gestión de ideas se consideran la base de la innovación. Recopilar datos sobre cómo evolucionan las actuales tecnologías y los competidores es otro factor clave para la innovación de las empresas. Por lo tanto, esta tesis se centra en la aplicación de Tecnologías de la Información y Comunicación, más concretamente la aplicación de Web Semántica y Tecnologías Semánticas en los Sistemas de Gestión de ideas y de Vigilancia Tecnológica. Los gestores de las plataformas de innovación y de vigilancia tecnológica se enfrentan a muchos problemas relacionados con los datos que recogen y gestionan. Esos gestores se enfrentan a una gran cantidad de información distribuida en diferentes plataformas, no siempre interoperables entre ellas. Es de vital importancia que las diferentes plataformas sean capaces de compartir datos entre ellas, de modo que esos datos puedan convertirse en el conocimiento. Muchas de las tareas realizadas por estos gestores son tareas no productivas y se invierte demasiado tiempo y esfuerzo en realizarlas. Además, los responsables de los procesos de innovación tienen dificultades para identificar por qué un concurso de ideas ha sido un éxito. Nuestra propuesta es analizar diferentes Tecnologías de Información y Comunicación que puedan ayudar a las empresas con sus procesos de Innovación y Vigilancia Tecnológica. Por ello, hemos estudiado varias tecnologías semánticas y Web, hemos desarrollado algunos modelos conceptuales y los hemos probado en diferentes casos de estudio para ver los resultados obtenidos en escenarios reales. El resultado de este trabajo ha sido la creación de una arquitectura que permite la interoperabilidad entre plataformas y que facilita el trabajo de los responsables de los procesos. En este marco, y para complementar la arquitectura, se han desarrollado dos ontologías: (1) Gi2Mo Wave y (2) Mentions Ontology. Gi2Mo Wave se centra en la anotación del contexto de los de ideas, ayudando en el análisis de los concursos y facilitando su replicación. Por otro lado, Mentions Ontology se centra en la anotación de los elementos mencionados en el texto plano de contenidos de diferente índole, como por ejemplo ideas o noticias. Así, Mentions Ontology crea una forma de encontrar relaciones entre contenidos, lo que permite la interoperabilidad entre los contenidos de diferentes plataformas. Con el fin de probar la arquitectura, también se han desarrollado dos plataformas: un Sistema de Gestión de Ideas y un Sistema de Vigilancia Tecnológica. Las plataformas incorporan ontologías semánticas y herramientas para permitir su interoperabilidad. Además, demostramos cómo reducir la carga de trabajo humana, mediante el uso de tecnologías semánticas para la clasificación automática del contenido del proceso de la Vigilancia Tecnológica. Por último, probando las tecnologías y herramientas se han recogido las conclusiones de acuerdo con los resultados obtenidos, identificando las que obtienen los mejores resultados

    Entity-Oriented Search

    Get PDF
    This open access book covers all facets of entity-oriented search—where “search” can be interpreted in the broadest sense of information access—from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book. The book is divided into three main parts, sandwiched between introductory and concluding chapters. The first two chapters introduce readers to the basic concepts, provide an overview of entity-oriented search tasks, and present the various types and sources of data that will be used throughout the book. Part I deals with the core task of entity ranking: given a textual query, possibly enriched with additional elements or structural hints, return a ranked list of entities. This core task is examined in a number of different variants, using both structured and unstructured data collections, and numerous query formulations. In turn, Part II is devoted to the role of entities in bridging unstructured and structured data. Part III explores how entities can enable search engines to understand the concepts, meaning, and intent behind the query that the user enters into the search box, and how they can provide rich and focused responses (as opposed to merely a list of documents)—a process known as semantic search. The final chapter concludes the book by discussing the limitations of current approaches, and suggesting directions for future research. Researchers and graduate students are the primary target audience of this book. A general background in information retrieval is sufficient to follow the material, including an understanding of basic probability and statistics concepts as well as a basic knowledge of machine learning concepts and supervised learning algorithms

    Knowledge-Based Techniques for Scholarly Data Access: Towards Automatic Curation

    Get PDF
    Accessing up-to-date and quality scientific literature is a critical preliminary step in any research activity. Identifying relevant scholarly literature for the extents of a given task or application is, however a complex and time consuming activity. Despite the large number of tools developed over the years to support scholars in their literature surveying activity, such as Google Scholar, Microsoft Academic search, and others, the best way to access quality papers remains asking a domain expert who is actively involved in the field and knows research trends and directions. State of the art systems, in fact, either do not allow exploratory search activity, such as identifying the active research directions within a given topic, or do not offer proactive features, such as content recommendation, which are both critical to researchers. To overcome these limitations, we strongly advocate a paradigm shift in the development of scholarly data access tools: moving from traditional information retrieval and filtering tools towards automated agents able to make sense of the textual content of published papers and therefore monitor the state of the art. Building such a system is however a complex task that implies tackling non trivial problems in the fields of Natural Language Processing, Big Data Analysis, User Modelling, and Information Filtering. In this work, we introduce the concept of Automatic Curator System and present its fundamental components.openDottorato di ricerca in InformaticaopenDe Nart, Dari

    Mining Meaning from Wikipedia

    Get PDF
    Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.Comment: An extensive survey of re-using information in Wikipedia in natural language processing, information retrieval and extraction and ontology building. Accepted for publication in International Journal of Human-Computer Studie
    corecore