85 research outputs found

    Content Enrichment of Digital Libraries: Methods, Technologies and Implementations

    Get PDF
    Parallel to the establishment of the concept of a "digital library", there have been rapid developments in the fields of semantic technologies, information retrieval and artificial intelligence. The idea is to use make use of these three fields to crosslink bibliographic data, i.e., library content, and to enrich it "intelligently" with additional, especially non-library, information. By linking the contents of a library, it is possible to offer users access to semantically similar contents of different digital libraries. For instance, a list of semantically similar publications from completely different subject areas and from different digital libraries can be made accessible. In addition, the user is able to see a wider profile about authors, enriched with information such as biographical details, name alternatives, images, job titles, institute affiliations, etc. This information comes from a wide variety of sources, most of which are not library sources. In order to make such scenarios a reality, this dissertation follows two approaches. The first approach is about crosslinking digital library content in order to offer semantically similar publications based on additional information for a publication. Hence, this approach uses publication-related metadata as a basis. The aligned terms between linked open data repositories/thesauri are considered as an important starting point by considering narrower, broader and related concepts through semantic data models such as SKOS. Information retrieval methods are applied to identify publications with high semantic similarity. For this purpose, approaches of vector space models and "word embedding" are applied and analyzed comparatively. The analyses are performed in digital libraries with different thematic focuses (e.g. economy and agriculture). Using machine learning techniques, metadata is enriched, e.g. with synonyms for content keywords, in order to further improve similarity calculations. To ensure quality, the proposed approaches will be analyzed comparatively with different metadata sets, which will be assessed by experts. Through the combination of different information retrieval methods, the quality of the results can be further improved. This is especially true when user interactions offer possibilities for adjusting the search properties. In the second approach, which this dissertation pursues, author-related data are harvested in order to generate a comprehensive author profile for a digital library. For this purpose, non-library sources, such as linked data repositories (e.g. WIKIDATA) and library sources, such as authority data, are used. If such different sources are used, the disambiguation of author names via the use of already existing persistent identifiers becomes necessary. To this end, we offer an algorithmic approach to disambiguate authors, which makes use of authority data such as the Virtual International Authority File (VIAF). Referring to computer sciences, the methodological value of this dissertation lies in the combination of semantic technologies with methods of information retrieval and artificial intelligence to increase the interoperability between digital libraries and between libraries with non-library sources. By positioning this dissertation as an application-oriented contribution to improve the interoperability, two major contributions are made in the context of digital libraries: (1) The retrieval of information from different Digital Libraries can be made possible via a single access. (2) Existing information about authors is collected from different sources and aggregated into one author profile.Parallel zur Etablierung des Konzepts einer „Digitalen Bibliothek“ gab es rasante Weiterentwicklungen in den Bereichen semantischer Technologien, Information Retrieval und kĂŒnstliche Intelligenz. Die Idee ist es, mit ihrer Hilfe bibliographische Daten, also Inhalte von Bibliotheken, miteinander zu vernetzen und „intelligent“ mit zusĂ€tzlichen, insbesondere nicht-bibliothekarischen Informationen anzureichern. Durch die VerknĂŒpfung von Inhalten einer Bibliothek wird es möglich, einen Zugang fĂŒr Benutzer*innen anzubieten, ĂŒber den semantisch Ă€hnliche Inhalte unterschiedlicher Digitaler Bibliotheken zugĂ€nglich werden. Beispielsweise können hierĂŒber ausgehend von einer bestimmten Publikation eine Liste semantisch Ă€hnlicher Publikationen ggf. aus völlig unterschiedlichen Themenfeldern und aus verschiedenen digitalen Bibliotheken zugĂ€nglich gemacht werden. DarĂŒber hinaus können sich Nutzer*innen ein breiteres Autoren-Profil anzeigen lassen, das mit Informationen wie biographischen Angaben, Namensalternativen, Bildern, Berufsbezeichnung, Instituts-Zugehörigkeiten usw. angereichert ist. Diese Informationen kommen aus unterschiedlichsten und in der Regel nicht-bibliothekarischen Quellen. Um derartige Szenarien RealitĂ€t werden zu lassen, verfolgt diese Dissertation zwei AnsĂ€tze. Der erste Ansatz befasst sich mit der Vernetzung von Inhalten Digitaler Bibliotheken, um auf Basis zusĂ€tzlicher Informationen fĂŒr eine Publikation semantisch Ă€hnliche Publikationen anzubieten. Dieser Ansatz verwendet publikationsbezogene Metadaten als Grundlage. Die verknĂŒpften Begriffe zwischen verlinkten offenen Datenrepositorien/Thesauri werden als wichtiger Angelpunkt betrachtet, indem Unterbegriffe, Oberbegriffe und verwandten Konzepte ĂŒber semantische Datenmodelle, wie SKOS, berĂŒcksichtigt werden. Methoden des Information Retrieval werden angewandt, um v.a. Publikationen mit hoher semantischer Verwandtschaft zu identifizieren. Zu diesem Zweck werden AnsĂ€tze des Vektorraummodells und des „Word Embedding“ eingesetzt und vergleichend analysiert. Die Analysen werden in Digitalen Bibliotheken mit unterschiedlichen thematischen Schwerpunkten (z.B. Wirtschaft und Landwirtschaft) durchgefĂŒhrt. Durch Techniken des maschinellen Lernens werden hierfĂŒr Metadaten angereichert, z.B. mit Synonymen fĂŒr inhaltliche Schlagwörter, um so Ähnlichkeitsberechnungen weiter zu verbessern. Zur Sicherstellung der QualitĂ€t werden die beiden AnsĂ€tze mit verschiedenen MetadatensĂ€tzen vergleichend analysiert wobei die Beurteilung durch Expert*innen erfolgt. Durch die VerknĂŒpfung verschiedener Methoden des Information Retrieval kann die QualitĂ€t der Ergebnisse weiter verbessert werden. Dies trifft insbesondere auch dann zu wenn Benutzerinteraktion Möglichkeiten zur Anpassung der Sucheigenschaften bieten. Im zweiten Ansatz, den diese Dissertation verfolgt, werden autorenbezogene Daten gesammelt, verbunden mit dem Ziel, ein umfassendes Autorenprofil fĂŒr eine Digitale Bibliothek zu generieren. FĂŒr diesen Zweck kommen sowohl nicht-bibliothekarische Quellen, wie Linked Data-Repositorien (z.B. WIKIDATA) und als auch bibliothekarische Quellen, wie Normdatensysteme, zum Einsatz. Wenn solch unterschiedliche Quellen genutzt werden, wird die Disambiguierung von Autorennamen ĂŒber die Nutzung bereits vorhandener persistenter Identifikatoren erforderlich. HierfĂŒr bietet sich ein algorithmischer Ansatz fĂŒr die Disambiguierung von Autoren an, der Normdaten, wie die des Virtual International Authority File (VIAF) nachnutzt. Mit Bezug zur Informatik liegt der methodische Wert dieser Dissertation in der Kombination von semantischen Technologien mit Verfahren des Information Retrievals und der kĂŒnstlichen Intelligenz zur Erhöhung von InteroperabilitĂ€t zwischen Digitalen Bibliotheken und zwischen Bibliotheken und nicht-bibliothekarischen Quellen. Mit der Positionierung dieser Dissertation als anwendungsorientierter Beitrag zur Verbesserung von InteroperabilitĂ€t werden zwei wesentliche BeitrĂ€ge im Kontext Digitaler Bibliotheken geleistet: (1) Die Recherche nach Informationen aus unterschiedlichen Digitalen Bibliotheken kann ĂŒber einen Zugang ermöglicht werden. (2) Vorhandene Informationen ĂŒber Autor*innen werden aus unterschiedlichsten Quellen eingesammelt und zu einem Autorenprofil aggregiert

    Generating knowledge graphs by employing Natural Language Processing and Machine Learning techniques within the scholarly domain

    Get PDF
    The continuous growth of scientific literature brings innovations and, at the same time, raises new challenges. One of them is related to the fact that its analysis has become difficult due to the high volume of published papers for which manual effort for annotations and management is required. Novel technological infrastructures are needed to help researchers, research policy makers, and companies to time-efficiently browse, analyse, and forecast scientific research. Knowledge graphs i.e., large networks of entities and relationships, have proved to be effective solution in this space. Scientific knowledge graphs focus on the scholarly domain and typically contain metadata describing research publications such as authors, venues, organizations, research topics, and citations. However, the current generation of knowledge graphs lacks of an explicit representation of the knowledge presented in the research papers. As such, in this paper, we present a new architecture that takes advantage of Natural Language Processing and Machine Learning methods for extracting entities and relationships from research publications and integrates them in a large-scale knowledge graph. Within this research work, we i) tackle the challenge of knowledge extraction by employing several state-of-the-art Natural Language Processing and Text Mining tools, ii) describe an approach for integrating entities and relationships generated by these tools, iii) show the advantage of such an hybrid system over alternative approaches, and vi) as a chosen use case, we generated a scientific knowledge graph including 109,105 triples, extracted from 26,827 abstracts of papers within the Semantic Web domain. As our approach is general and can be applied to any domain, we expect that it can facilitate the management, analysis, dissemination, and processing of scientific knowledge

    Journalistic Knowledge Platforms: from Idea to Realisation

    Get PDF
    Journalistiske kunnskapsplattformer (JKPer) er en type intelligente informasjonssystemer designet for Ä forbedre nyhetsproduksjonsprosesser ved Ä kombinere stordata, kunstig intelligens (KI) og kunnskapsbaser for Ä stÞtte journalister. Til tross for sitt potensial for Ä revolusjonere journalistikkfeltet, har adopsjonen av JKPer vÊrt treg, med forskere og store nyhetsutlÞp involvert i forskning og utvikling av JKPer. Den langsomme adopsjonen kan tilskrives den tekniske kompleksiteten til JKPer, som har fÞrt til at nyhetsorganisasjoner stoler pÄ flere uavhengige og oppgavespesifikke produksjonssystemer. Denne situasjonen kan Þke ressurs- og koordineringsbehovet og kostnadene, samtidig som den utgjÞr en trussel om Ä miste kontrollen over data og havne i leverandÞrlÄssituasjoner. De tekniske kompleksitetene forblir en stor hindring, ettersom det ikke finnes en allerede godt utformet systemarkitektur som ville lette realiseringen og integreringen av JKPer pÄ en sammenhengende mÄte over tid. Denne doktoravhandlingen bidrar til teorien og praksisen rundt kunnskapsgrafbaserte JKPer ved Ä studere og designe en programvarearkitektur som referanse for Ä lette iverksettelsen av konkrete lÞsninger og adopsjonen av JKPer. Den fÞrste bidraget til denne doktoravhandlingen gir en grundig og forstÄelig analyse av ideen bak JKPer, fra deres opprinnelse til deres nÄvÊrende tilstand. Denne analysen gir den fÞrste studien noensinne av faktorene som har bidratt til den langsomme adopsjonen, inkludert kompleksiteten i deres sosiale og tekniske aspekter, og identifiserer de stÞrste utfordringene og fremtidige retninger for JKPer. Den andre bidraget presenterer programvarearkitekturen som referanse, som gir en generisk blÄkopi for design og utvikling av konkrete JKPer. Den foreslÄtte referansearkitekturen definerer ogsÄ to nye typer komponenter ment for Ä opprettholde og videreutvikle KI-modeller og kunnskapsrepresentasjoner. Den tredje presenterer et eksempel pÄ iverksettelse av programvarearkitekturen som referanse og beskriver en prosess for Ä forbedre effektiviteten til informasjonsekstraksjonspipelines. Denne rammen muliggjÞr en fleksibel, parallell og samtidig integrering av teknikker for naturlig sprÄkbehandling og KI-verktÞy. I tillegg diskuterer denne avhandlingen konsekvensene av de nyeste KI-fremgangene for JKPer og ulike etiske aspekter ved bruk av JKPer. Totalt sett gir denne PhD-avhandlingen en omfattende og grundig analyse av JKPer, fra teorien til designet av deres tekniske aspekter. Denne forskningen tar sikte pÄ Ä lette vedtaket av JKPer og fremme forskning pÄ dette feltet.Journalistic Knowledge Platforms (JKPs) are a type of intelligent information systems designed to augment news creation processes by combining big data, artificial intelligence (AI) and knowledge bases to support journalists. Despite their potential to revolutionise the field of journalism, the adoption of JKPs has been slow, with scholars and large news outlets involved in the research and development of JKPs. The slow adoption can be attributed to the technical complexity of JKPs that led news organisation to rely on multiple independent and task-specific production system. This situation can increase the resource and coordination footprint and costs, at the same time it poses a threat to lose control over data and face vendor lock-in scenarios. The technical complexities remain a major obstacle as there is no existing well-designed system architecture that would facilitate the realisation and integration of JKPs in a coherent manner over time. This PhD Thesis contributes to the theory and practice on knowledge-graph based JKPs by studying and designing a software reference architecture to facilitate the instantiation of concrete solutions and the adoption of JKPs. The first contribution of this PhD Thesis provides a thorough and comprehensible analysis of the idea of JKPs, from their origins to their current state. This analysis provides the first-ever study of the factors that have contributed to the slow adoption, including the complexity of their social and technical aspects, and identifies the major challenges and future directions of JKPs. The second contribution presents the software reference architecture that provides a generic blueprint for designing and developing concrete JKPs. The proposed reference architecture also defines two novel types of components intended to maintain and evolve AI models and knowledge representations. The third presents an instantiation example of the software reference architecture and details a process for improving the efficiency of information extraction pipelines. This framework facilitates a flexible, parallel and concurrent integration of natural language processing techniques and AI tools. Additionally, this Thesis discusses the implications of the recent AI advances on JKPs and diverse ethical aspects of using JKPs. Overall, this PhD Thesis provides a comprehensive and in-depth analysis of JKPs, from the theory to the design of their technical aspects. This research aims to facilitate the adoption of JKPs and advance research in this field.Doktorgradsavhandlin

    When linguistics meets web technologies. Recent advances in modelling linguistic linked data

    Get PDF
    This article provides an up-to-date and comprehensive survey of models (including vocabularies, taxonomies and ontologies) used for representing linguistic linked data (LLD). It focuses on the latest developments in the area and both builds upon and complements previous works covering similar territory. The article begins with an overview of recent trends which have had an impact on linked data models and vocabularies, such as the growing influence of the FAIR guidelines, the funding of several major projects in which LLD is a key component, and the increasing importance of the relationship of the digital humanities with LLD. Next, we give an overview of some of the most well known vocabularies and models in LLD. After this we look at some of the latest developments in community standards and initiatives such as OntoLex-Lemon as well as recent work which has been in carried out in corpora and annotation and LLD including a discussion of the LLD metadata vocabularies META-SHARE and lime and language identifiers. In the following part of the paper we look at work which has been realised in a number of recent projects and which has a significant impact on LLD vocabularies and models

    B!SON: A Tool for Open Access Journal Recommendation

    Get PDF
    Finding a suitable open access journal to publish scientific work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of Predatory Publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. It is developed based on a systematic requirements analysis, built on open data, gives publisher-independent recommendations and works across domains. It suggests open access journals based on title, abstract and references provided by the user. The recommendation quality has been evaluated using a large test set of 10,000 articles. Development by two German scientific libraries ensures the longevity of the project

    Thinking outside the graph: scholarly knowledge graph construction leveraging natural language processing

    Get PDF
    Despite improved digital access to scholarly knowledge in recent decades, scholarly communication remains exclusively document-based. The document-oriented workflows in science publication have reached the limits of adequacy as highlighted by recent discussions on the increasing proliferation of scientific literature, the deficiency of peer-review and the reproducibility crisis. In this form, scientific knowledge remains locked in representations that are inadequate for machine processing. As long as scholarly communication remains in this form, we cannot take advantage of all the advancements taking place in machine learning and natural language processing techniques. Such techniques would facilitate the transformation from pure text based into (semi-)structured semantic descriptions that are interlinked in a collection of big federated graphs. We are in dire need for a new age of semantically enabled infrastructure adept at storing, manipulating, and querying scholarly knowledge. Equally important is a suite of machine assistance tools designed to populate, curate, and explore the resulting scholarly knowledge graph. In this thesis, we address the issue of constructing a scholarly knowledge graph using natural language processing techniques. First, we tackle the issue of developing a scholarly knowledge graph for structured scholarly communication, that can be populated and constructed automatically. We co-design and co-implement the Open Research Knowledge Graph (ORKG), an infrastructure capable of modeling, storing, and automatically curating scholarly communications. Then, we propose a method to automatically extract information into knowledge graphs. With Plumber, we create a framework to dynamically compose open information extraction pipelines based on the input text. Such pipelines are composed from community-created information extraction components in an effort to consolidate individual research contributions under one umbrella. We further present MORTY as a more targeted approach that leverages automatic text summarization to create from the scholarly article's text structured summaries containing all required information. In contrast to the pipeline approach, MORTY only extracts the information it is instructed to, making it a more valuable tool for various curation and contribution use cases. Moreover, we study the problem of knowledge graph completion. exBERT is able to perform knowledge graph completion tasks such as relation and entity prediction tasks on scholarly knowledge graphs by means of textual triple classification. Lastly, we use the structured descriptions collected from manual and automated sources alike with a question answering approach that builds on the machine-actionable descriptions in the ORKG. We propose JarvisQA, a question answering interface operating on tabular views of scholarly knowledge graphs i.e., ORKG comparisons. JarvisQA is able to answer a variety of natural language questions, and retrieve complex answers on pre-selected sub-graphs. These contributions are key in the broader agenda of studying the feasibility of natural language processing methods on scholarly knowledge graphs, and lays the foundation of which methods can be used on which cases. Our work indicates what are the challenges and issues with automatically constructing scholarly knowledge graphs, and opens up future research directions
    • 

    corecore