Search CORE

12,215 research outputs found

Genome sequence analysis with MonetDB - A case study on Ebola virus diversity

Author: Cijvat C.P. (Robin)
Kersten M.L. (Martin)
Klau G.W. (Gunnar)
Manegold S. (Stefan)
Marschall T. (Tobias)
Schönhuth A. (Alexander)
Zhang Y. (Ying)
Publication venue: Springer Verlag
Publication date: 01/11/2015
Field of study

Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but yields terabytes of data to be stored and analyzed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables \textit{easy}, \textit{flexible}, and \textit{rapid} management and analysis of sequence alignment data stored as Sequence Alignment/Map \\(SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus \\genomes

CWI's Institutional Repository

Genome sequence analysis with MonetDB: a case study on Ebola virus diversity

Author: Cijvat C.P. (Robin)
Kersten M.L. (Martin)
Klau G.W. (Gunnar)
Manegold S. (Stefan)
Marschall T. (Tobias)
Schönhuth A. (Alexander)
Zhang Y. (Ying)
Publication venue
Publication date: 03/03/2015
Field of study

Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but results in terabytes of data to be stored and analysed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables easy, flexible, and rapid management and analysis of sequence alignment data stored as Sequence Alignment/Map (SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus genomes

CWI's Institutional Repository

Deep Symbolic Learning Architecture for Variant Calling in NGS

Author: Canal-Alonso Ángel
Corchado Rodríguez Juan Manuel
Egido Noelia
Jiménez Pedro
Prieto Tejedor Javier
Publication venue
Publication date: 01/01/2022
Field of study

[EN]The Variant Detection process (Variant Calling) is fundamental in bioinformatics, demanding maximum precision and reliability. This study examines an innovative integration strategy between a traditional pipeline developed in-house and an advanced Intelligent System (IS). Although the original pipeline already had tools based on traditional algorithms, it had limitations, particularly in the detection of rare or unknown variants. Therefore, SI was introduced with the aim of providing an additional layer of analysis, capitalizing on deep and symbolic learning techniques to improve and enhance previous detections. The main technical challenge lay in interoperability. To overcome this, NextFlow, a scripting language designed to manage complex bioinformatics workflows, was employed. Through NextFlow, communication and efficient data transfer between the original pipeline and the SI were facilitated, thus guaranteeing compatibility and reproducibility. After the Variant Calling process of the original system, the results were transmitted to the SI, where a meticulous sequence of analysis was implemented, from preprocessing to data fusion. As a result, an optimized set of variants was generated that was integrated with previous results. Variants corroborated by both tools were considered to be of high reliability, while discrepancies indicated areas for detailed investigations. The product of this integration advanced to subsequent stages of the pipeline, usually annotation or interpretation, contextualizing the variants from biological and clinical perspectives. This adaptation not only maintained the original functionalities of the pipeline, but was also enhanced with the SI, establishing a new standard in the Variant Calling process. This research offers a robust and efficient model for the detection and analysis of genomic variants, highlighting the promise and applicability of blended learning in bioinformaticsThis study has been funded by the AIR Genomics project (with file number CCTT3/20/SA/0003), through the call 2020 R&D PROJECTS ORIENTED TO THE EXCELLENCE AND COMPETITIVE IMPROVEMENT OF THE CCTT by the Institute of Business Competitiveness of Castilla y León and FEDER fund

Gestion del Repositorio Documental de la Universidad de Salamanca

Framing Apache Spark in life sciences

Author: Armano Giuliano
Gnocchi Matteo
Manconi Andrea
Marullo Osvaldo
Milanesi Luciano
Publication venue
Publication date: 01/01/2023
Field of study

Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities

Archivio istituzionale della ricerca - Università di Cagliari

Pyramide. A physical engine for the management of entity-relationship databases

Author: Rossi Didier
Publication venue
Publication date: 01/01/1989
Field of study

Repository of the University of Namur

Computational pan-genomics: status, promises and challenges

Author: Abeel Thomas
Alkan Can
Baaijens Jasmijn
Bakker Paul
Boeva Valentina
Bonnal Raoul
Chiaromonte Francesca
Chikhi Rayan
Ciccarelli Francesca
Cijvat Robin
Datema Erwin
Dijkstra Louis
Duijn Cornelia
Dutilh Bas
Eichler Evan
El-Kebir Mohammed
Ernst Corinna
Eskin Eleazar
Garrison Erik
Ghaffaari Ali
Guryev Victor
Kersey Paul
Klau Gunnar
Kloosterman Wigard
Korbel Jan
Lameijer Eric-Wubbo
Langmead Benjamin
Marschall Tobias
Martin Marcel
Marz Manja
Medvedev Paul
Mu John
Mäkinen Veli
Neerincx Pieter
Novak Adam
Ouwens Klaasjan
Paten Benedict
Peterlongo Pierre
Pisanti Nadia
Porubsky David
Rahmann Sven
Raphael Benjamin
Reinert Knut
Ridder Dick
Ridder Jeroen
Rivals Eric
Sanders Ashley
Schlesner Matthias
Schulz-Trieglaff Ole
Schönhuth Alexander
Sheikhizadeh Siavash
Shneider Carl
Smit Sandra
The Computational Pan-Genomics Consortium
Valenzuela Daniel
Vandin Fabio
Wang Jiayin
Wessels Lodewyk
Ye Kai
Zhang Ying
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

EUR Research Repository

HAL-MINES ParisTech

Archivio della ricerca della Scuola Superiore Sant'Anna