16,799 research outputs found
Software Citation Implementation Challenges
The main output of the FORCE11 Software Citation working group
(https://www.force11.org/group/software-citation-working-group) was a paper on
software citation principles (https://doi.org/10.7717/peerj-cs.86) published in
September 2016. This paper laid out a set of six high-level principles for
software citation (importance, credit and attribution, unique identification,
persistence, accessibility, and specificity) and discussed how they could be
used to implement software citation in the scholarly community. In a series of
talks and other activities, we have promoted software citation using these
increasingly accepted principles. At the time the initial paper was published,
we also provided guidance and examples on how to make software citable, though
we now realize there are unresolved problems with that guidance. The purpose of
this document is to provide an explanation of current issues impacting
scholarly attribution of research software, organize updated implementation
guidance, and identify where best practices and solutions are still needed
Crowdsourcing Linked Data on listening experiences through reuse and enhancement of library data
Research has approached the practice of musical reception in a multitude of ways, such as the analysis of professional critique, sales figures and psychological processes activated by the act of listening. Studies in the Humanities, on the other hand, have been hindered by the lack of structured evidence of actual experiences of listening as reported by the listeners themselves, a concern that was voiced since the early Web era. It was however assumed that such evidence existed, albeit in pure textual form, but could not be leveraged until it was digitised and aggregated. The Listening Experience Database (LED) responds to this research need by providing a centralised hub for evidence of listening in the literature. Not only does LED support search and reuse across nearly 10,000 records, but it also provides machine-readable structured data of the knowledge around the contexts of listening. To take advantage of the mass of formal knowledge that already exists on the Web concerning these contexts, the entire framework adopts Linked Data principles and technologies. This also allows LED to directly reuse open data from the British Library for the source documentation that is already published. Reused data are re-published as open data with enhancements obtained by expanding over the model of the original data, such as the partitioning of published books and collections into individual stand-alone documents. The database was populated through crowdsourcing and seamlessly incorporates data reuse from the very early data entry phases. As the sources of the evidence often contain vague, fragmentary of uncertain information, facilities were put in place to generate structured data out of such fuzziness. Alongside elaborating on these functionalities, this article provides insights into the most recent features of the latest instalment of the dataset and portal, such as the interlinking with the MusicBrainz database, the relaxation of geographical input constraints through text mining, and the plotting of key locations in an interactive geographical browser
The LIFE2 final project report
Executive summary: The first phase of LIFE (Lifecycle Information For E-Literature) made a major contribution to
understanding the long-term costs of digital preservation; an essential step in helping
institutions plan for the future. The LIFE work models the digital lifecycle and calculates the
costs of preserving digital information for future years. Organisations can apply this process
in order to understand costs and plan effectively for the preservation of their digital
collections
The second phase of the LIFE Project, LIFE2, has refined the LIFE Model adding three new
exemplar Case Studies to further build upon LIFE1. LIFE2 is an 18-month JISC-funded
project between UCL (University College London) and The British Library (BL), supported
by the LIBER Access and Preservation Divisions. LIFE2 began in March 2007, and
completed in August 2008.
The LIFE approach has been validated by a full independent economic review and has
successfully produced an updated lifecycle costing model (LIFE Model v2) and digital
preservation costing model (GPM v1.1). The LIFE Model has been tested with three further
Case Studies including institutional repositories (SHERPA-LEAP), digital preservation
services (SHERPA DP) and a comparison of analogue and digital collections (British Library
Newspapers). These Case Studies were useful for scenario building and have fed back into
both the LIFE Model and the LIFE Methodology.
The experiences of implementing the Case Studies indicated that enhancements made to the
LIFE Methodology, Model and associated tools have simplified the costing process. Mapping
a specific lifecycle to the LIFE Model isnāt always a straightforward process. The revised and
more detailed Model has reduced ambiguity. The costing templates, which were refined
throughout the process of developing the Case Studies, ensure clear articulation of both
working and cost figures, and facilitate comparative analysis between different lifecycles.
The LIFE work has been successfully disseminated throughout the digital preservation and
HE communities. Early adopters of the work include the Royal Danish Library, State
Archives and the State and University Library, Denmark as well as the LIFE2 Project partners.
Furthermore, interest in the LIFE work has not been limited to these sectors, with interest in
LIFE expressed by local government, records offices, and private industry. LIFE has also
provided input into the LC-JISC Blue Ribbon Task Force on the Economic Sustainability of
Digital Preservation.
Moving forward our ability to cost the digital preservation lifecycle will require further
investment in costing tools and models. Developments in estimative models will be needed to
support planning activities, both at a collection management level and at a later preservation
planning level once a collection has been acquired. In order to support these developments a
greater volume of raw cost data will be required to inform and test new cost models. This
volume of data cannot be supported via the Case Study approach, and the LIFE team would
suggest that a software tool would provide the volume of costing data necessary to provide a
truly accurate predictive model
Entity-centric knowledge discovery for idiosyncratic domains
Technical and scientific knowledge is produced at an ever-accelerating pace, leading to increasing issues when trying to automatically organize or process it, e.g., when searching for relevant prior work. Knowledge can today be produced both in unstructured (plain text) and structured (metadata or linked data) forms. However, unstructured content is still themost dominant formused to represent scientific knowledge. In order to facilitate the extraction and discovery of relevant content, new automated and scalable methods for processing, structuring and organizing scientific knowledge are called for. In this context, a number of applications are emerging, ranging fromNamed Entity Recognition (NER) and Entity Linking tools for scientific papers to specific platforms leveraging information extraction techniques to organize scientific knowledge. In this thesis, we tackle the tasks of Entity Recognition, Disambiguation and Linking in idiosyncratic domains with an emphasis on scientific literature. Furthermore, we study the related task of co-reference resolution with a specific focus on named entities. We start by exploring Named Entity Recognition, a task that aims to identify the boundaries of named entities in textual contents. We propose a newmethod to generate candidate named entities based on n-gram collocation statistics and design several entity recognition features to further classify them. In addition, we show how the use of external knowledge bases (either domain-specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic domains. Subsequently, we move to Entity Disambiguation, which is typically performed after entity recognition in order to link an entity to a knowledge base. We propose novel semi-supervised methods for word disambiguation leveraging the structure of a community-based ontology of scientific concepts. Our approach exploits the graph structure that connects different terms and their definitions to automatically identify the correct sense that was originally picked by the authors of a scientific publication. We then turn to co-reference resolution, a task aiming at identifying entities that appear using various forms throughout the text. We propose an approach to type entities leveraging an inverted index built on top of a knowledge base, and to subsequently re-assign entities based on the semantic relatedness of the introduced types. Finally, we describe an application which goal is to help researchers discover and manage scientific publications. We focus on the problem of selecting relevant tags to organize collections of research papers in that context. We experimentally demonstrate that the use of a community-authored ontology together with information about the position of the concepts in the documents allows to significantly increase the precision of tag selection over standard methods
The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives
The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building -- all proceeding concurrently in mutually --reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from over two hundred users, totaling around 300 terabytes of web archives.This research was supported by the Andrew W. Mellon Foundation, the Social Sciences and Humanities Research Council of Canada, as well as Start Smart Labs, Compute Canada, the University of Waterloo, and York University. Weād like to thank Jeremy Wiebe, Ryan Deschamps, and Gursimran Singh for their contributions
Recommended from our members
AQUA: an ontology driven question answering system
This paper describes AQUA our question answering over the Web. AQUA was designed to work over heterogeneous sources. This means that AQUA is equipped to work as closed domain and in addition to open-domain question answering. As a first instance, AQUA tries to answer a question using a Knowledge base. If a query cannot be satisfied over a knowledge base/database. Then, AQUA tries to find an answer on web pages (i.e. it uses as corpus the internet as resource). Our system uses NLP (Natural Language Processing), First order logic and Information Extraction technologies. AQUA has been tested using an ontology which describes academic life. Keywords Ontologies, Information Extraction, Machine Learnin
The Semantic Web MIDI Tape: An Interface for Interlinking MIDI and Context Metadata
The Linked Data paradigm has been used to publish a large number of musical datasets and ontologies on the Semantic Web, such as MusicBrainz, AcousticBrainz, and the Music Ontology. Recently, the MIDI Linked Data Cloud has been added to these datasets, representing more than 300,000 pieces in MIDI format as Linked Data, opening up the possibility for linking fine-grained symbolic music representations to existing music metadata databases. Despite the dataset making MIDI resources available in Web data standard formats such as RDF and SPARQL, the important issue of finding meaningful links between these MIDI resources and relevant contextual metadata in other datasets remains. A fundamental barrier for the provision and generation of such links is the difficulty that users have at adding new MIDI performance data and metadata to the platform. In this paper, we propose the Semantic Web MIDI Tape, a set of tools and associated interface for interacting with the MIDI Linked Data Cloud by enabling users to record, enrich, and retrieve MIDI performance data and related metadata in native Web data standards. The goal of such interactions is to find meaningful links between published MIDI resources and their relevant contextual metadata. We evaluate the Semantic Web MIDI Tape in various use cases involving user-contributed content, MIDI similarity querying, and entity recognition methods, and discuss their potential for finding links between MIDI resources and metadata
Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling
We evaluate the impact of probabilistically-constructed digital identity data
collected from Sep. to Dec. 2017 (approx.), in the context of
Lookalike-targeted campaigns. The backbone of this study is a large set of
probabilistically-constructed "identities", represented as small bags of
cookies and mobile ad identifiers with associated metadata, that are likely all
owned by the same underlying user. The identity data allows to generate
"identity-based", rather than "identifier-based", user models, giving a fuller
picture of the interests of the users underlying the identifiers. We employ
off-policy techniques to evaluate the potential of identity-powered lookalike
models without incurring the risk of allowing untested models to direct large
amounts of ad spend or the large cost of performing A/B tests. We add to
historical work on off-policy evaluation by noting a significant type of
"finite-sample bias" that occurs for studies combining modestly-sized datasets
and evaluation metrics involving rare events (e.g., conversions). We illustrate
this bias using a simulation study that later informs the handling of inverse
propensity weights in our analyses on real data. We demonstrate significant
lift in identity-powered lookalikes versus an identity-ignorant baseline: on
average ~70% lift in conversion rate. This rises to factors of ~(4-32)x for
identifiers having little data themselves, but that can be inferred to belong
to users with substantial data to aggregate across identifiers. This implies
that identity-powered user modeling is especially important in the context of
identifiers having very short lifespans (i.e., frequently churned cookies). Our
work motivates and informs the use of probabilistically-constructed identities
in marketing. It also deepens the canon of examples in which off-policy
learning has been employed to evaluate the complex systems of the internet
economy.Comment: Accepted by WSDM 201
- ā¦