12,215 research outputs found

    Genome sequence analysis with MonetDB - A case study on Ebola virus diversity

    Get PDF
    Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but yields terabytes of data to be stored and analyzed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables \textit{easy}, \textit{flexible}, and \textit{rapid} management and analysis of sequence alignment data stored as Sequence Alignment/Map \\(SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus \\genomes

    Genome sequence analysis with MonetDB: a case study on Ebola virus diversity

    Get PDF
    Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but results in terabytes of data to be stored and analysed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables easy, flexible, and rapid management and analysis of sequence alignment data stored as Sequence Alignment/Map (SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus genomes

    Deep Symbolic Learning Architecture for Variant Calling in NGS

    Get PDF
    [EN]The Variant Detection process (Variant Calling) is fundamental in bioinformatics, demanding maximum precision and reliability. This study examines an innovative integration strategy between a traditional pipeline developed in-house and an advanced Intelligent System (IS). Although the original pipeline already had tools based on traditional algorithms, it had limitations, particularly in the detection of rare or unknown variants. Therefore, SI was introduced with the aim of providing an additional layer of analysis, capitalizing on deep and symbolic learning techniques to improve and enhance previous detections. The main technical challenge lay in interoperability. To overcome this, NextFlow, a scripting language designed to manage complex bioinformatics workflows, was employed. Through NextFlow, communication and efficient data transfer between the original pipeline and the SI were facilitated, thus guaranteeing compatibility and reproducibility. After the Variant Calling process of the original system, the results were transmitted to the SI, where a meticulous sequence of analysis was implemented, from preprocessing to data fusion. As a result, an optimized set of variants was generated that was integrated with previous results. Variants corroborated by both tools were considered to be of high reliability, while discrepancies indicated areas for detailed investigations. The product of this integration advanced to subsequent stages of the pipeline, usually annotation or interpretation, contextualizing the variants from biological and clinical perspectives. This adaptation not only maintained the original functionalities of the pipeline, but was also enhanced with the SI, establishing a new standard in the Variant Calling process. This research offers a robust and efficient model for the detection and analysis of genomic variants, highlighting the promise and applicability of blended learning in bioinformaticsThis study has been funded by the AIR Genomics project (with file number CCTT3/20/SA/0003), through the call 2020 R&D PROJECTS ORIENTED TO THE EXCELLENCE AND COMPETITIVE IMPROVEMENT OF THE CCTT by the Institute of Business Competitiveness of Castilla y LeĂłn and FEDER fund

    Framing Apache Spark in life sciences

    Get PDF
    Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities

    Computational pan-genomics: status, promises and challenges

    Get PDF
    International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains
    • 

    corecore