17 research outputs found
Micropublication: incentivizing community curation and placing unpublished data into the public domain
Large volumes of data generated by research laboratories coupled with the required effort and cost of curation present a significant barrier to inclusion of these data in authoritative community databases. Further, many publicly funded experimental observations remain invisible to curation simply because they are never published: results often do not fit within the scope of a standard publication; trainee-generated data are forgotten when the experimenter (e.g. student, post-doc) leaves the lab; results are omitted from science narratives due to publication bias where certain results are considered irrelevant for the publication. While authors are in the best position to curate their own data, they face a steep learning curve to ensure that appropriate referential tags, metadata, and ontologies are applied correctly to their observations, a task sometimes considered beyond the scope of their research and other numerous responsibilities. Getting researchers to adopt a new system of data reporting and curation requires a fundamental change in behavior among all members of the research community. To solve these challenges, we have created a novel scholarly communication platform that captures data from researchers and directly delivers them to information resources via Micropublication. This platform incentivizes authors to publish their unpublished observations along with associated metadata by providing a deliberately fast and lightweight but still peer-reviewed process that results in a citable publication. Our long-term goal is to develop a data ecosystem that improves reproducibility and accountability of publicly funded research and in turn accelerates both basic and translational discovery
Recommended from our members
Translational bioinformatics in mental health: open access data sources and computational biomarker discovery
Mental illness is increasingly recognized as both a significant cost to society and a significant area of opportunity for biological breakthrough. As -omics and imaging technologies enable researchers to probe molecular and physiological underpinnings of multiple diseases, opportunities arise to explore the biological basis for behavioral health and disease. From individual investigators to large international consortia, researchers have generated rich data sets in the area of mental health, including genomic, transcriptomic, metabolomic, proteomic, clinical and imaging resources. General data repositories such as the Gene Expression Omnibus (GEO) and Database of Genotypes and Phenotypes (dbGaP) and mental health (MH)-specific initiatives, such as the Psychiatric Genomics Consortium, MH Research Network and PsychENCODE represent a wealth of information yet to be gleaned. At the same time, novel approaches to integrate and analyze data sets are enabling important discoveries in the area of mental and behavioral health. This review will discuss and catalog into an organizing framework the increasingly diverse set of MH data resources available, using schizophrenia as a focus area, and will describe novel and integrative approaches to molecular biomarker discovery that make use of mental health data.National Institutes of Health [UL1TR001117, R01LM012095, R01LM012806]Open access articleThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
User-centered semantic dataset retrieval
Finding relevant research data is an increasingly important but time-consuming task in daily research practice. Several studies report on difficulties in dataset search, e.g., scholars retrieve only partial pertinent data, and important information can not be displayed in the user interface. Overcoming these problems has motivated a number of research efforts in computer science, such as text mining and semantic search. In particular, the emergence of the Semantic Web opens a variety of novel research perspectives. Motivated by these challenges, the overall aim of this work is to analyze the current obstacles in dataset search and to propose and develop a novel semantic dataset search. The studied domain is biodiversity research, a domain that explores the diversity of life, habitats and ecosystems. This thesis has three main contributions: (1) We evaluate the current situation in dataset search in a user study, and we compare a semantic search with a classical keyword search to explore the suitability of semantic web technologies for dataset search. (2) We generate a question corpus and develop an information model to figure out on what scientific topics scholars in biodiversity research are interested in. Moreover, we also analyze the gap between current metadata and scholarly search interests, and we explore whether metadata and user interests match. (3) We propose and develop an improved dataset search based on three components: (A) a text mining pipeline, enriching metadata and queries with semantic categories and URIs, (B) a retrieval component with a semantic index over categories and URIs and (C) a user interface that enables a search within categories and a search including further hierarchical relations. Following user centered design principles, we ensure user involvement in various user studies during the development process
Dataset Search in Biodiversity Research: Do Metadata in Data Repositories Reflect Scholarly Information Needs?
Abstract
The increasing amount of publicly available research data provides the opportunity to link and integrate data in order to create and prove novel hypotheses, to repeat experiments or to compare recent data to data collected at a different time or place. However, recent studies have shown that retrieving relevant data for data reuse is a time-consuming task in daily research practice. In this study, we explore what hampers dataset retrieval in biodiversity research, a field that produces a large amount of heterogeneous data. In particular, we focus on scholarly search interests and metadata, the primary source of data in a dataset retrieval system. We show that existing metadata currently poorly reflect information needs and therefore are the biggest obstacle in retrieving relevant data. Our findings indicate that for data seekers in the biodiversity domain environments, materials and chemicals, species, biological and chemical processes, locations, data parameters and data types are important information categories. These interests are well covered in metadata elements of domain-specific standards. However, instead of utilizing these standards, large data repositories tend to use metadata standards with domain-independent metadata fields that cover search interests only to some extent. A second problem are arbitrary keywords utilized in descriptive fields such as title, description or subject. Keywords support scholars in a full text search only if the provided terms syntactically match or their semantic relationship to terms used in a user query is known
A recommender system for scientific datasets and analysis pipelines
Scientific datasets and analysis pipelines are increasingly being shared
publicly in the interest of open science.
However, mechanisms are lacking to reliably identify which pipelines
and datasets can appropriately be used together. Given the increasing number of high-quality public datasets and
pipelines, this lack of clear compatibility threatens the
findability and reusability of these resources. We investigate
the feasibility of a collaborative filtering system to recommend pipelines
and datasets based on provenance records from previous executions.
We evaluate our system using datasets and pipelines extracted from the
Canadian Open Neuroscience Platform, a national initiative for open
neuroscience. The recommendations provided by our system (AUC) are
significantly better than chance and outperform recommendations made by
domain experts using their previous knowledge as well as pipeline and dataset descriptions (AUC). In particular, domain experts often neglect
low-level technical aspects of a pipeline-dataset interaction, such as the level of pre-processing, which are
captured by a provenance-based system. We conclude that provenance-based
pipeline and dataset recommenders are feasible and beneficial to
the sharing and usage of open-science resources. Future
work will focus on the collection of more
comprehensive provenance traces, and on deploying the system in production
Dataset Search: A lightweight, community-built tool to support research data discovery
Objective: Promoting discovery of research data helps archived data realize its potential to advance knowledge. Montana State University (MSU) Dataset Search aims to support discovery and reporting for research datasets created by researchers at institutions.
Methods and Results: The Dataset Search application consists of five core features: a streamlined browse and search interface, a data model based on dataset discovery, a harvesting process for finding and vetting datasets stored in external repositories, an administrative interface for managing the creation, ingest, and maintenance of dataset records, and a dataset visualization interface to demonstrate how data is produced and used by MSU researchers.
Conclusion: The Dataset Search application is designed to be easily customized and implemented by other institutions. Indexes like Dataset Search can improve search and discovery for content archived in data repositories, therefore amplifying the impact and benefits of archived data
Understanding Data Search as a Socio-technical Practice
Open research data are heralded as having the potential to increase
effectiveness, productivity, and reproducibility in science, but little is
known about the actual practices involved in data search. The socio-technical
problem of locating data for reuse is often reduced to the technological
dimension of designing data search systems. We combine a bibliometric study of
the current academic discourse around data search with interviews with data
seekers. In this article, we explore how adopting a contextual, socio-technical
perspective can help to understand user practices and behavior and ultimately
help to improve the design of data discovery systems.Comment: 19 pages, 3 figures, 7 table
The CAMH Neuroinformatics Platform: A Hospital-Focused Brain-CODE Implementation
Investigations of mental illness have been enriched by the advent and maturation of neuroimaging technologies and the rapid pace and increased affordability of molecular sequencing techniques, however, the increased volume, variety and velocity of research data, presents a considerable technical and analytic challenge to curate, federate and interpret. Aggregation of high-dimensional datasets across brain disorders can increase sample sizes and may help identify underlying causes of brain dysfunction, however, additional barriers exist for effective data harmonization and integration for their combined use in research. To help realize the potential of multi-modal data integration for the study of mental illness, the Centre for Addiction and Mental Health (CAMH) constructed a centralized data capture, visualization and analytics environment—the CAMH Neuroinformatics Platform—based on the Ontario Brain Institute (OBI) Brain-CODE architecture, towards the curation of a standardized, consolidated psychiatric hospital-wide research dataset, directly coupled to high performance computing resources
brainlife.io: A decentralized and open source cloud platform to support neuroscience research
Neuroscience research has expanded dramatically over the past 30 years by
advancing standardization and tool development to support rigor and
transparency. Consequently, the complexity of the data pipeline has also
increased, hindering access to FAIR data analysis to portions of the worldwide
research community. brainlife.io was developed to reduce these burdens and
democratize modern neuroscience research across institutions and career levels.
Using community software and hardware infrastructure, the platform provides
open-source data standardization, management, visualization, and processing and
simplifies the data pipeline. brainlife.io automatically tracks the provenance
history of thousands of data objects, supporting simplicity, efficiency, and
transparency in neuroscience research. Here brainlife.io's technology and data
services are described and evaluated for validity, reliability,
reproducibility, replicability, and scientific utility. Using data from 4
modalities and 3,200 participants, we demonstrate that brainlife.io's services
produce outputs that adhere to best practices in modern neuroscience research