17 research outputs found

    Challenges of big data integration in the life sciences

    No full text
    Big data has been reported to be revolutionizing many areas of life, including science. It summarizes data that is unprecedentedly large, rapidly generated, heterogeneous, and hard to accurately interpret. This availability has also brought new challenges: How to properly annotate data to make it searchable? What are the legal and ethical hurdles when sharing data? How to store data securely, preventing loss and corruption? The life sciences are not the only disciplines that must align themselves with big data requirements to keep up with the latest developments. The large hadron collider, for instance, generates research data at a pace beyond any current biomedical research center. There are three recent major coinciding events that explain the emergence of big data in the context of research: the technological revolution for data generation, the development of tools for data analysis, and a conceptual change towards open science and data. The true potential of big data lies in pattern discovery in large datasets, as well as the formulation of new models and hypotheses. Confirmation of the existence of the Higgs boson, for instance, is one of the most recent triumphs of big data analysis in physics. Digital representations of biological systems have become more comprehensive. This, in combination with advances in machine learning, creates exciting new research possibilities. In this paper, we review the state of big data in bioanalytical research and provide an overview of the guidelines for its proper usage

    Improving Software Engineering in Biostatistics: Challenges and Opportunities

    No full text
    Programming is ubiquitous in applied biostatistics; adopting software engineering skills will help biostatisticians do a better job. To explain this, we start by highlighting key challenges for software development and application in biostatistics. Silos between different statistician roles, projects, departments, and organizations lead to the development of duplicate and suboptimal code. Building on top of open-source software requires critical appraisal and risk-based assessment of the used modules. Code that is written needs to be readable to ensure reliable software. The software needs to be easily understandable for the user, as well as developed within testing frameworks to ensure that long term maintenance of the software is feasible. Finally, the reproducibility of research results is hindered by manual analysis workflows and uncontrolled code development. We next describe how the awareness of the importance and application of good software engineering practices and strategies can help address these challenges. The foundation is a better education in basic software engineering skills in schools, universities, and during the work life. Dedicated software engineering teams within academic institutions and companies can be a key factor for the establishment of good software engineering practices and catalyze improvements across research projects. Providing attractive career paths is important for the retainment of talents. Readily available tools can improve the reproducibility of statistical analyses and their use can be exercised in community events. [...

    A data management infrastructure for the integration of imaging and omics data in life sciences

    No full text
    BACKGROUND: As technical developments in omics and biomedical imaging increase the throughput of data generation in life sciences, the need for information systems capable of managing heterogeneous digital assets is increasing. In particular, systems supporting the findability, accessibility, interoperability, and reusability (FAIR) principles of scientific data management. RESULTS: We propose a Service Oriented Architecture approach for integrated management and analysis of multi-omics and biomedical imaging data. Our architecture introduces an image management system into a FAIR-supporting, web-based platform for omics data management. Interoperable metadata models and middleware components implement the required data management operations. The resulting architecture allows for FAIR management of omics and imaging data, facilitating metadata queries from software applications. The applicability of the proposed architecture is demonstrated using two technical proofs of concept and a use case, aimed at molecular plant biology and clinical liver cancer research, which integrate various imaging and omics modalities. CONCLUSIONS: We describe a data management architecture for integrated, FAIR-supporting management of omics and biomedical imaging data, and exemplify its applicability for basic biology research and clinical studies. We anticipate that FAIR data management systems for multi-modal data repositories will play a pivotal role in data-driven research, including studies which leverage advanced machine learning methods, as the joint analysis of omics and imaging data, in conjunction with phenotypic metadata, becomes not only desirable but necessary to derive novel insights into biological processes. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04584-3

    dRNA-seq transcriptional profiling of the FK506 biosynthetic gene cluster in <i>Streptomyces tsukubaensis</i> NRRL18488 and general analysis of the transcriptome

    No full text
    <p>FK506 (tacrolimus) is a valuable immunosuppressant produced by several <i>Streptomyces</i> strains. In the genome of the wild type producer <i>Streptomyces tsukubaensis</i> NRRL18488, FK506 biosynthesis is encoded by a gene cluster that spans 83.5 (kb). A whole transcriptome differential shotgun sequencing (dRNA-seq) of <i>S. tsukubaensis</i> was performed to analyze transcription at 2 different time points; before and during active FK506 production. In total, 8,914 transcription start sites were identified in either condition, which enabled precise determination of the 5′-UTR length of the corresponding transcripts as well as the identification of 2 consensus sequence motifs in the promoter regions. The transcription start sites of all gene operons within the FK506 cluster were identified, including 3 examples of leaderless RNA transcripts. These data provide detailed insight into the transcription of the FK506 biosynthetic gene cluster to support future regulatory studies, genetic manipulation, and industrial production.</p
    corecore