122 research outputs found

    Building and Using Digital Libraries for ETDs

    Get PDF
    Despite the high value of electronic theses and dissertations (ETDs), the global collection has seen limited use. To extend such use, a new approach to building digital libraries (DLs) is needed. Fortunately, recent decades have seen that a vast amount of “gray literature” has become available through a diverse set of institutional repositories as well as regional and national libraries and archives. Most of the works in those collections include ETDs and are often freely available in keeping with the open-access movement, but such access is limited by the services of supporting information systems. As explained through a set of scenarios, ETDs can better meet the needs of diverse stakeholders if customer discovery methods are used to identify personas and user roles as well as their goals and tasks. Hence, DLs, with a rich collection of services, as well as newer, more advanced ones, can be organized so that those services, and expanded workflows building on them, can be adapted to meet personalized goals as well as traditional ones, such as discovery and exploration

    Characterising the IIIF and Linked Art communities: survey report

    Get PDF
    This report presents the findings and analysis of a survey conducted between 24 March and 7 May 2023, exploring the socio-technical characteristics of two prevalent community-driven initiatives in Digital Humanities, namely the International Image Interoperability Framework (IIIF) and Linked Art. With 79 participants, the survey investigates the practices and activities of individuals involved in these initiatives, which focus on developing and maintaining shared application programming interfaces (APIs) for enhanced interoperability and access to cultural heritage resources. It also seeks to situate these initiatives within a broader discourse of scholarly movements and principles. Additionally, it serves as a preliminary means of exploring the prospective impact of Linked Open Usable Data (LOUD) and its underlying design principles in the cultural heritage field

    SciKGTeX -- A LaTeX Package to Semantically Annotate Contributions in Scientific Publications

    Full text link
    Scientific knowledge graphs have been proposed as a solution to structure the content of research publications in a machine-actionable way and enable more efficient, computer-assisted workflows for many research activities. Crowd-sourcing approaches are used frequently to build and maintain such scientific knowledge graphs. To contribute to scientific knowledge graphs, researchers need simple and easy-to-use solutions to generate new knowledge graph elements and establish the practice of semantic representations in scientific communication. In this paper, we present a workflow for authors of scientific documents to specify their contributions with a LaTeX package, called SciKGTeX, and upload them to a scientific knowledge graph. The SciKGTeX package allows authors of scientific publications to mark the main contributions of their work directly in LaTeX source files. The package embeds marked contributions as metadata into the generated PDF document, from where they can be extracted automatically and imported into a scientific knowledge graph, such as the ORKG. This workflow is simpler and faster than current approaches, which make use of external web interfaces for data entry. Our user evaluation shows that SciKGTeX is easy to use, with a score of 79 out of 100 on the System Usability Scale, as participants of the study needed only 7 minutes on average to annotate the main contributions on a sample abstract of a published paper. Further testing shows that the embedded contributions can be successfully uploaded to ORKG within ten seconds. SciKGTeX simplifies the process of manual semantic annotation of research contributions in scientific articles. Our workflow demonstrates how a scientific knowledge graph can automatically ingest research contributions from document metadata.Comment: Accepted for publication at the ACM/IEEE Joint Conference on Digital Libraries 2023 (JCDL2023

    SciKGTeX - A LATEX Package to Semantically Annotate Contributions in Scientific Publications

    Get PDF
    Scientific knowledge graphs have been proposed as a solution to structure the content of research publications in a machine-actionable way and enable more efficient, computer-assisted work-flows for many research activities. Crowd-sourcing approaches are used frequently to build and maintain such scientific knowledge graphs. To contribute to scientific knowledge graphs, researchers need simple and easy-to-use solutions to generate new knowledge graph elements and establish the practice of semantic representations in scientific communication. In this paper, we present a workflow for authors of scientific documents to specify their contributions with a LATEX package, called SciKGTeX, and upload them to a scientific knowledge graph. The SciKGTeX package allows authors of scientific publications to mark the main contributions of their work directly in LATEX source files. The package embeds marked contributions as metadata into the generated PDF document, from where they can be extracted automatically and imported into a scientific knowledge graph, such as the ORKG. This workflow is simpler and faster than current approaches, which make use of external web interfaces for data entry. Our user evaluation shows that SciKGTeX is easy to use, with a score of 79 out of 100 on the System Usability Scale, as participants of the study needed only 7 minutes on average to annotate the main contributions on a sample abstract of a published paper. Further testing shows that the embedded contributions can be successfully uploaded to ORKG within ten seconds. SciKGTeX simplifies the process of manual semantic annotation of research contributions in scientific articles. Our workflow demonstrates how a scientific knowledge graph can automatically ingest research contributions from document metadata.© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Research on Digital Preservation: An empirical analysis

    Get PDF
    Digital preservation is an evolving area of research for libraries, archives, and museums across the globe over the last two decades. Due to the growing recognition of the need to address various issues dealt with digital preservation, this field of study has generated quite a range of scholarly communications on several aspects. The present paper aims to examine critically the extant literature on digital preservation and libraries for the period from 2001 to 2019 and to assess the evolving trajectory and trends. Out of a total of 1292 extracted records from the Scopus database, a total of 710 articles are considered for the study purpose after the exclusion of non-relevant articles. Employing bibliometric indicators the study primarily assessed the publication pattern, document types, the most prolific authors, most contributing institutions, and focus areas of study as well as the geographical distribution of publications. Along with this, the VOSviewer software is used for co-author network analysis. The findings of the current analysis reveal that the highest number of papers published in the source journal Lecture Notes in Computer Science while the U.S.A. is in the top spot among the countries and author Nelson, M. L. from the U.S.A. has published the maximum number of research papers. It also provides information on various forms of publication on digital preservation and the impactful papers. Though there are studies on the assessment of digital libraries and digital repositories, a bibliometric assessment of literature on digital preservation is a novel attempt. As a metric study, it reflects the relative position of a country, an institution, and a researcher

    BIP! NDR (NoDoiRefs): A Dataset of Citations From Papers Without DOIs in Computer Science Conferences and Workshops

    Full text link
    In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation has created a void in available data. BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text. The current version of the dataset contains more than 510K citations made by approximately 60K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI

    On the Impact of Cross-Domain Data on German Language Models

    Full text link
    Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45%4.45\% over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essenComment: 13 pages, 1 figure, accepted at Findings of the Association for Computational Linguistics: EMNLP 202

    MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

    Get PDF
    With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in Memento aggregators. A memento is a past version of a web page and a Memento aggregator is a tool or service that aggregates mementos from many different web archives. To save resources, the Memento aggregator should only poll the archives that are likely to have a copy of the requested Uniform Resource Identifier (URI). Using the Crawler Index (CDX), we generate profiles of the archives that summarize their holdings and use them to inform routing of the Memento aggregator’s URI requests. Additionally, we use full text search (when available) or sample URI lookups to build an understanding of an archive’s holdings. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. For evaluation we used CDX files from Archive-It, UK Web Archive, Stanford Web Archive Portal, and Arquivo.pt. Moreover, we used web server access log files from the Internet Archive’s Wayback Machine, UK Web Archive, Arquivo.pt, LANL’s Memento Proxy, and ODU’s MemGator Server. In addition, we utilized historical dataset of URIs from DMOZ. In early experiments with various URI-based static profiling policies we successfully identified about 78% of the URIs that were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and 94% URIs with less than 10% relative cost without any false negatives. In another experiment we found that we can correctly route 80% of the requests while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile. We created MementoMap, a framework that allows web archives and third parties to express holdings and/or voids of an archive of any size with varying levels of details to fulfil various application needs. Our archive profiling framework enables tools and services to predict and rank archives where mementos of a requested URI are likely to be present. In static profiling policies we predefined the maximum depth of host and path segments of URIs for each policy that are used as URI keys. This gave us a good baseline for evaluation, but was not suitable for merging profiles with different policies. Later, we introduced a more flexible means to represent URI keys that uses wildcard characters to indicate whether a URI key was truncated. Moreover, we developed an algorithm to rollup URI keys dynamically at arbitrary depths when sufficient archiving activity is detected under certain URI prefixes. In an experiment with dynamic profiling of archival holdings we found that a MementoMap of less than 1.5% relative cost can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive without any false negatives (i.e., 100% recall). In addition, we separately evaluated archival voids based on the most frequently accessed resources in the access log and found that we could have avoided more than 8% of the false positives without introducing any false negatives. We defined a routing score that can be used for Memento routing. Using a cut-off threshold technique on our routing score we achieved over 96% accuracy if we accept about 89% recall and for a recall of 99% we managed to get about 68% accuracy, which translates to about 72% saving in wasted lookup requests in our Memento aggregator. Moreover, when using top-k archives based on our routing score for routing and choosing only the topmost archive, we missed only about 8% of the sample URIs that are present in at least one archive, but when we selected top-2 archives, we missed less than 2% of these URIs. We also evaluated a machine learning-based routing approach, which resulted in an overall better accuracy, but poorer recall due to low prevalence of the sample lookup URI dataset in different web archives. We contributed various algorithms, such as a space and time efficient approach to ingest large lists of URIs to generate MementoMaps and a Random Searcher Model to discover samples of holdings of web archives. We contributed numerous tools to support various aspects of web archiving and replay, such as MemGator (a Memento aggregator), Inter- Planetary Wayback (a novel archival replay system), Reconstructive (a client-side request rerouting ServiceWorker), and AccessLog Parser. Moreover, this work yielded a file format specification draft called Unified Key Value Store (UKVS) that we use for serialization and dissemination of MementoMaps. It is a flexible and extensible file format that allows easy interactions with Unix text processing tools. UKVS can be used in many applications beyond MementoMaps
    corecore