830 research outputs found

    X-search: An Open Access Interface for Cross-Cohort Exploration of the National Sleep Research Resource

    Get PDF
    Background: The National Sleep Research Resource (NSRR) is a large-scale, openly shared, data repository of de-identified, highly curated clinical sleep data from multiple NIH-funded epidemiological studies. Although many data repositories allow users to browse their content, few support fine-grained, cross-cohort query and exploration at study-subject level. We introduce a cross-cohort query and exploration system, called X-search, to enable researchers to query patient cohort counts across a growing number of completed, NIH-funded studies in NSRR and explore the feasibility or likelihood of reusing the data for research studies. Methods: X-search has been designed as a general framework with two loosely-coupled components: semantically annotated data repository and cross-cohort exploration engine. The semantically annotated data repository is comprised of a canonical data dictionary, data sources with a data dictionary, and mappings between each individual data dictionary and the canonical data dictionary. The cross-cohort exploration engine consists of five modules: query builder, graphical exploration, case-control exploration, query translation, and query execution. The canonical data dictionary serves as the unified metadata to drive the visual exploration interfaces and facilitate query translation through the mappings. Results: X-search is publicly available at https://www.x-search.net/ with nine NSRR datasets consisting of over 26,000 unique subjects. The canonical data dictionary contains over 900 common data elements across the datasets. X-search has received over 1800 cross-cohort queries by users from 16 countries. Conclusions: X-search provides a powerful cross-cohort exploration interface for querying and exploring heterogeneous datasets in the NSRR data repository, so as to enable researchers to evaluate the feasibility of potential research studies and generate potential hypotheses using the NSRR data

    Towards Exascale Scientific Metadata Management

    Full text link
    Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines have been capturing a limited form of metadata to provide provenance information about the identity and lineage of the data. However, much of the data produced by simulations, experiments, and analyses still need to be annotated manually in an ad hoc manner by domain scientists. Systematic and transparent acquisition of rich metadata becomes a crucial prerequisite to sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and domain-agnostic metadata management infrastructure that can meet the demands of extreme-scale science is notable by its absence. To address this gap in scientific data management research and practice, we present our vision for an integrated approach that (1) automatically captures and manipulates information-rich metadata while the data is being produced or analyzed and (2) stores metadata within each dataset to permeate metadata-oblivious processes and to query metadata through established and standardized data access interfaces. We motivate the need for the proposed integrated approach using applications from plasma physics, climate modeling and neuroscience, and then discuss research challenges and possible solutions

    Semantically-aware data discovery and placement in collaborative computing environments

    Get PDF
    As the size of scientific datasets and the demand for interdisciplinary collaboration grow in modern science, it becomes imperative that better ways of discovering and placing datasets generated across multiple disciplines be developed to facilitate interdisciplinary scientific research. For discovering relevant data out of large-scale interdisciplinary datasets. The development and integration of cross-domain metadata is critical as metadata serves as the key guideline for organizing data. To develop and integrate cross-domain metadata management systems in interdisciplinary collaborative computing environment, three key issues need to be addressed: the development of a cross-domain metadata schema; the implementation of a metadata management system based on this schema; the integration of the metadata system into existing distributed computing infrastructure. Current research in metadata management in distributed computing environment largely focuses on relatively simple schema that lacks the underlying descriptive power to adequately address semantic heterogeneity often found in interdisciplinary science. And current work does not take adequate consideration the issue of scalability in large-scale data management. Another key issue in data management is data placement, due to the increasing size of scientific datasets, the overhead incurred as a result of transferring data among different nodes also grow into a significant inhibiting factor affecting overall performance. Currently, few data placement strategies take into consideration semantic information concerning data content. In this dissertation, we propose a cross-domain metadata system in a collaborative distributed computing environment and identify and evaluate key factors and processes involved in a successful cross-domain metadata system with the goal of facilitating data discovery in collaborative environments. This will allow researchers/users to conduct interdisciplinary science in the context of large-scale datasets that will make it easier to access interdisciplinary datasets, reduce barrier to collaboration, reduce cost of future development of similar systems. We also investigate data placement strategies that involve semantic information about the hardware and network environment as well as domain information in the form of semantic metadata so that semantic locality could be utilized in data placement, that could potentially reduce overhead for accessing large-scale interdisciplinary datasets

    Systematizing FAIR research data management in biomedical research projects: a data life cycle approach

    Get PDF
    Biomedical researchers are facing data management challenges brought by a new generation of data driven by the advent of translational medicine research. These challenges are further complicated by the recent calls for data re-use and long-term stewardship spearheaded by the FAIR principles initiative. As a result, there is an increasingly wide-spread recognition that advancing biomedical science is becoming dependent on the application of data science to manage and utilize highly diverse and complex data in ways that give it context, meaning, and longevity beyond its initial purpose. However, current methods and practices in biomedical informatics remain to adopt a traditional linear view of the informatics process (collect, store and analyse); focusing primarily on the challenges in data integration and analysis, which are challenges only pertaining to a part of the overall life cycle of research data. The aim of this research is to facilitate the adoption and integration of data management practices into the research life cycle of biomedical projects, thus improving their capabilities into solving data management-related challenges that they face throughout the course of their research work. To achieve this aim, this thesis takes a data life cycle approach to define and develop a systematic methodology and framework towards the systematization of FAIR data management in biomedical research projects. The overarching contribution of this research is the provision of a data-state life cycle model for research data management in Biomedical Translational Research Projects. This model provides insight into the dynamics between 1) the purpose of a research-driven data use case, 2) the data requirements that renders data in a state fit for purpose, 3) the data management functions that prepare and act upon data and 4) the resulting state of data that is _t to serve the use case. This insight led to the development of a FAIR data management framework, which is another contribution of this thesis. This framework provides data managers the groundwork, including the data models, resources and capabilities, needed to build a FAIR data management environment to manage data during the operational stages of a biomedical research project. An exemplary implementation of this architecture (PlatformTM) was developed and validated by real-world research datasets produced by collaborative research programs funded by the Innovative Medicine Initiative (IMI) BioVacSafe 1 , eTRIKS 2 and FAIRplus 3.Open Acces

    From Sensor to Observation Web with Environmental Enablers in the Future Internet

    Get PDF
    This paper outlines the grand challenges in global sustainability research and the objectives of the FP7 Future Internet PPP program within the Digital Agenda for Europe. Large user communities are generating significant amounts of valuable environmental observations at local and regional scales using the devices and services of the Future Internet. These communities’ environmental observations represent a wealth of information which is currently hardly used or used only in isolation and therefore in need of integration with other information sources. Indeed, this very integration will lead to a paradigm shift from a mere Sensor Web to an Observation Web with semantically enriched content emanating from sensors, environmental simulations and citizens. The paper also describes the research challenges to realize the Observation Web and the associated environmental enablers for the Future Internet. Such an environmental enabler could for instance be an electronic sensing device, a web-service application, or even a social networking group affording or facilitating the capability of the Future Internet applications to consume, produce, and use environmental observations in cross-domain applications. The term ?envirofied? Future Internet is coined to describe this overall target that forms a cornerstone of work in the Environmental Usage Area within the Future Internet PPP program. Relevant trends described in the paper are the usage of ubiquitous sensors (anywhere), the provision and generation of information by citizens, and the convergence of real and virtual realities to convey understanding of environmental observations. The paper addresses the technical challenges in the Environmental Usage Area and the need for designing multi-style service oriented architecture. Key topics are the mapping of requirements to capabilities, providing scalability and robustness with implementing context aware information retrieval. Another essential research topic is handling data fusion and model based computation, and the related propagation of information uncertainty. Approaches to security, standardization and harmonization, all essential for sustainable solutions, are summarized from the perspective of the Environmental Usage Area. The paper concludes with an overview of emerging, high impact applications in the environmental areas concerning land ecosystems (biodiversity), air quality (atmospheric conditions) and water ecosystems (marine asset management)

    METADATA MANAGEMENT FOR CLINICAL DATA INTEGRATION

    Get PDF
    Clinical data have been continuously collected and growing with the wide adoption of electronic health records (EHR). Clinical data have provided the foundation to facilitate state-of-art researches such as artificial intelligence in medicine. At the same time, it has become a challenge to integrate, access, and explore study-level patient data from large volumes of data from heterogeneous databases. Effective, fine-grained, cross-cohort data exploration, and semantically enabled approaches and systems are needed. To build semantically enabled systems, we need to leverage existing terminology systems and ontologies. Numerous ontologies have been developed recently and they play an important role in semantically enabled applications. Because they contain valuable codified knowledge, the management of these ontologies, as metadata, also requires systematic approaches. Moreover, in most clinical settings, patient data are collected with the help of a data dictionary. Knowledge of the relationships between an ontology and a related data dictionary is important for semantic interoperability. Such relationships are represented and maintained by mappings. Mappings store how data source elements and domain ontology concepts are linked, as well as how domain ontology concepts are linked between different ontologies. While mappings are crucial to the maintenance of relationships between an ontology and a related data dictionary, they are commonly captured by CSV files with limits capabilities for sharing, tracking, and visualization. The management of mappings requires an innovative, interactive, and collaborative approach. Metadata management servers to organize data that describes other data. In computer science and information science, ontology is the metadata consisting of the representation, naming, and definition of the hierarchies, properties, and relations between concepts. A structural, scalable, and computer understandable way for metadata management is critical to developing systems with the fine-grained data exploration capabilities. This dissertation presents a systematic approach called MetaSphere using metadata and ontologies to support the management and integration of clinical research data through our ontology-based metadata management system for multiple domains. MetaSphere is a general framework that aims to manage specific domain metadata, provide fine-grained data exploration interface, and store patient data in data warehouses. Moreover, MetaSphere provides a dedicated mapping interface called Interactive Mapping Interface (IMI) to map the data dictionary to well-recognized and standardized ontologies. MetaSphere has been applied to three domains successfully, sleep domain (X-search), pressure ulcer injuries and deep tissue pressure (SCIPUDSphere), and cancer. Specifically, MetaSphere stores domain ontology structurally in databases. Patient data in the corresponding domains are also stored in databases as data warehouses. MetaSphere provides a powerful query interface to enable interaction between human and actual patient data. Query interface is a mechanism allowing researchers to compose complex queries to pinpoint specific cohort over a large amount of patient data. The MetaSphere framework has been instantiated into three domains successfully and the detailed results are as below. X-search is publicly available at https://www.x-search.net with nine sleep domain datasets consisting of over 26,000 unique subjects. The canonical data dictionary contains over 900 common data elements across the datasets. X-search has received over 1800 cross-cohort queries by users from 16 countries. SCIPUDSphere has integrated a total number of 268,562 records containing 282 ICD9 codes related to pressure ulcer injuries among 36,626 individuals with spinal cord injuries. IMI is publicly available at http://epi-tome.com/. Using IMI, we have successfully mapped the North American Association of Central Cancer Registries (NAACCR) data dictionary to the National Cancer Institute Thesaurus (NCIt) concepts

    Enhancing the FAIRness of Arctic Research Data Through Semantic Annotation

    Get PDF
    The National Science Foundation’s Arctic Data Center is the primary data repository for NSF-funded research conducted in the Arctic. There are major challenges in discovering and interpreting resources in a repository containing data as heterogeneous and interdisciplinary as those in the Arctic Data Center. This paper reports on advances in cyberinfrastructure at the Arctic Data Center that help address these issues by leveraging semantic technologies that enhance the repository’s adherence to the FAIR data principles and improve the Findability, Accessibility, Interoperability, and Reusability of digital resources in the repository. We describe the Arctic Data Center’s improvements. We use semantic annotation to bind metadata about Arctic data sets with concepts in web-accessible ontologies. The Arctic Data Center’s implementation of a semantic annotation mechanism is accompanied by the development of an extended search interface that increases the findability of data by allowing users to search for specific, broader, and narrower meanings of measurement descriptions, as well as through their potential synonyms. Based on research carried out by the DataONE project, we evaluated the potential impact of this approach, regarding the accessibility, interoperability, and reusability of measurement data. Arctic research often benefits from having additional data, typically from multiple, heterogeneous sources, that complement and extend the bases – spatially, temporally, or thematically – for understanding Arctic phenomena. These relevant data resources must be ‘found’, and ‘harmonized’ prior to integration and analysis. The findings of a case study indicated that the semantic annotation of measurement data enhances the capabilities of researchers to accomplish these tasks
    • 

    corecore