4 research outputs found
Disappearing repositories -- taking an infrastructure perspective on the long-term availability of research data
Currently, there is limited research investigating the phenomenon of research
data repositories being shut down, and the impact this has on the long-term
availability of data. This paper takes an infrastructure perspective on the
preservation of research data by using a registry to identify 191 research data
repositories that have been closed and presenting information on the shutdown
process. The results show that 6.2 % of research data repositories indexed in
the registry were shut down. The risks resulting in repository shutdown are
varied. The median age of a repository when shutting down is 12 years.
Strategies to prevent data loss at the infrastructure level are pursued to
varying extent. 44 % of the repositories in the sample migrated data to another
repository, and 12 % maintain limited access to their data collection. However,
both strategies are not permanent solutions. Finally, the general lack of
information on repository shutdown events as well as the effect on the
findability of data and the permanence of the scholarly record are discussed
Chapter 5. Overlooked and overrated data sharing: Why some scientists are confused and/or dismissive
This chapter is an expert from the book Curating Research Data, Volume One: Practical Strategies for Your Digital Repository edited by Lisa R. Johnston published by American College & Research Libraries (ACRL) in January 2017. The book is available from the American Library Association in print and as a open access e-book at www.alastore.ala.org. ISBN-13: 9780838988589Ope
Provenance, propagation and quality of biological annotation
PhD ThesisBiological databases have become an integral part of the life sciences, being used
to store, organise and share ever-increasing quantities and types of data. Biological
databases are typically centred around raw data, with individual entries being
assigned to a single piece of biological data, such as a DNA sequence. Although essential,
a reader can obtain little information from the raw data alone. Therefore,
many databases aim to supplement their entries with annotation, allowing the current
knowledge about the underlying data to be conveyed to a reader. Although annotations
come in many di erent forms, most databases provide some form of free text
annotation.
Given that annotations can form the foundations of future work, it is important that a
user is able to evaluate the quality and correctness of an annotation. However, this is
rarely straightforward. The amount of annotation, and the way in which it is curated,
varies between databases. For example, the production of an annotation in some
databases is entirely automated, without any manual intervention. Further, sections
of annotations may be reused, being propagated between entries and, potentially,
external databases. This provenance and curation information is not always apparent
to a user.
The work described within this thesis explores issues relating to biological annotation
quality. While the most valuable annotation is often contained within free text, its lack
of structure makes it hard to assess. Initially, this work describes a generic approach
that allows textual annotations to be quantitatively measured. This approach is based
upon the application of Zipf's Law to words within textual annotation, resulting in a
single value, . The relationship between the value and Zipf's principle of least e ort
provides an indication as to the annotations quality, whilst also allowing annotations
to be quantitatively compared.
Secondly, the thesis focuses on determining annotation provenance and tracking any
subsequent propagation. This is achieved through the development of a visualisation
- i -
framework, which exploits the reuse of sentences within annotations. Utilising this
framework a number of propagation patterns were identi ed, which on analysis appear
to indicate low quality and erroneous annotation.
Together, these approaches increase our understanding in the textual characteristics
of biological annotation, and suggests that this understanding can be used to increase
the overall quality of these resources