Search CORE

16 research outputs found

Bi-Dimensional Binning for Big Genomic Datasets

Author: Cattani Simone
Ceri Stefano
Kaitoua Abdulrahman
Pinoli Pietro
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Binning the genome is used in order to parallelize big data operations upon regions. In this extended abstract, we comparatively evaluate the performance and scalability of Spark and SciDB implementations over datasets consisting of billions of genomic regions. In particular, we introduce an original method for binning the genome, i.e. partitioning it into sections of small sizes, and show that it outperforms conventional binning used by SciDB and closes the gap between SciDB and a Spark-based implementation. The concept of bi-dimensional binning is new and can be extended to other systems and technologies

Archivio istituzionale della ricerca - Politecnico di Milano

Framework for Supporting Genomic Operations

Author: Bertoni Michele
Ceri Stefano
Kaitoua Abdulrahman
Pinoli Pietro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Next Generation Sequencing (NGS) is a family of technologies for reading the DNA or RNA, capable of producing whole genome sequences at an impressive speed, and causing a revolution of both biological research and medical practice. In this exciting scenario, while a huge number of specialized bio-informatics programs extract information from sequences, there is an increasing need for a new generation of systems and frameworks capable of integrating such information, providing holistic answers to the needs of biologists and clinicians. To respond to this need, we developed GMQL, a new query language for genomic data management that operates on heterogeneous genomic datasets. In this paper, we focus on three domain-specific operations of GMQL used for the efficient processing of operations on genomic regions, and we describe their efficient implementation; the paper develops a theory of binning strategies as a generic approach to parallel execution of genomic operations, and then describes how binning is embedded into two efficient implementations of the operations using Flink and Spark, two emerging frameworks for data management on the cloud

Archivio istituzionale della ricerca - Politecnico di Milano

Scalable genomic data management system on the cloud

Author: Ceri Stefano
Gulino Andrea
Kaitoua Abdulrahman
Masseroli Marco
Pinoli Pietro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Thanks to the huge amount of sequenced data that is becoming available, building scalable solutions for supporting query processing and data analysis over genomics datasets is increasingly important. This paper presents GDMS, a scalable Genomic Data Management System for querying region-based genomic datasets; the focus of the paper is on the deployment of the system on a cluster hosted by CINECA

Archivio istituzionale della ricerca - Politecnico di Milano

Data management for heterogeneous genomic datasets

Author: Ceri Stefano
Kaitoua Abdulrahman
Masseroli Marco
Pinoli Pietro
Venco Francesco
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Next Generation Sequencing (NGS), a family of technologies for reading the DNA and RNA, is changing biological research, and will soon change medical practice, by quickly providing sequencing data and high-level features of numerous individual genomes in different biological and clinical conditions. Availability of millions of whole genome sequences may soon become the biggest and most important ”big data” problem of mankind. In this exciting framework, we recently proposed a new paradigm to raise the level of abstraction in NGS data management, by introducing a GenoMetric Query Language (GMQL) and demonstrating its usefulness through several biological query examples. Leveraging on that effort, here we motivate and formalize GMQL operations, especially focusing on the most characteristic and domain-specific ones. Furthermore, we address their efficient implementation and illustrate the architecture of the new software system that we have developed for their execution on big genomic data in a cloud computing environment, providing the evaluation of its performance. The new system implementation is available for download at the GMQL website (http://www.bioinformatics.deib.polimi.it/GMQL/); GMQL can also be tested through a set of predefined queries on ENCODE and Roadmap Epigenomics data at http://www.bioinformatics.deib.polimi.it/GMQL/queries/

Archivio istituzionale della ricerca - Politecnico di Milano

Demonstration of GenoMetric Query Language

Author: Canakoglu Arif
Ceri Stefano
Gulino Andrea
Kaitoua Abdulrahman
Masseroli Marco
Nanni Luca
Pinoli Pietro
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying

Author: CERI STEFANO
KAITOUA ABDULRAHMAN
MASSEROLI MARCO
PINOLI PIETRO
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

While a huge amount of (epi)genomic data of multiple types is becoming available by using Next Generation Sequencing (NGS) technologies, the most important emerging problem is the so-called tertiary analysis, concerned with sense making, e.g., discovering how different (epi)genomic regions and their products interact and cooperate with each other. We propose a paradigm shift in tertiary analysis, based on the use of the Genomic Data Model (GDM), a simple data model which links genomic feature data to their associated experimental, biological and clinical metadata. GDM encompasses all the data formats which have been produced for feature extraction from (epi)genomic datasets. We specifically describe the mapping to GDM of SAM (Sequence Alignment/Map), VCF (Variant Call Format), NARROWPEAK (for called peaks produced by NGS ChIP-seq or DNase-seq methods), and BED (Browser Extensible Data) formats, but GDM supports as well all the formats describing experimental datasets (e.g., including copy number variations, DNA somatic mutations, or gene expressions) and annotations (e.g., regarding transcription start sites, genes, enhancers or CpG islands). We downloaded and integrated samples of all the above-mentioned data types and formats from multiple sources. The GDM is able to homogeneously describe semantically heterogeneous data and makes the ground for providing data interoperability, e.g., achieved through the GenoMetric Query Language (GMQL), a high-level, declarative query language for genomic big data. The combined use of the data model and the query language allows comprehensive processing of multiple heterogeneous data, and supports the development of domain-specific data-driven computations and bio-molecular knowledge discovery

Archivio istituzionale della ricerca - Politecnico di Milano

Optimal Binning for Genomics

Author: Ceri Stefano
Gulino Andrea
Kaitoua Abdulrahman
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Genome sequencing is expected to be the most prolific source of big data in the next decade; millions of whole genome datasets will open new opportunities for biological research and personalized medicine. Genome sequences are abstracted in the form of interesting regions, describing abnormalities of the genome. The parallel execution on the cloud of complex operations for joining and mapping billions of genomic regions is increasingly important. Genome binning, i.e., partitioning of the genome into small-size segments, adapts classic data partitioning methods to genomics; region distributions to bins must reflect operation-specific correctness rules. As a consequence, determining the optimal bin size for such operations is a complex mathematical problem, whose solution requires careful modeling. The main result of this paper is the mathematical formulation and solution of the optimal binning problem for join and map operations in the context of GMQL, a query language over genomic regions; the model is validated by experiments showing its accuracy and sensitivity to the variation of operations' parameters. We also optimize sequences of operations by inheriting the binning between two consecutive operations and we show the deployment of GMQL and the tuning of the proposed model on different cloud computing systems

Archivio istituzionale della ricerca - Politecnico di Milano

Evaluating cloud frameworks on genomic applications

Author: Bertoni Michele
Ceri Stefano
Kaitoua Abdulrahman
Pinoli Pietro
Publication venue
Publication date: 01/01/2015
Field of study

We are developing a new, holistic data management system for genomics, which uses cloud-based computing for querying thousands of heterogeneous genomic datasets. In our project, it is essential to leverage upon a modern cloud computing framework, so as to encode our query expressions into high-level operations provided by the framework. After releasing our first implementation using Pig and Hadoop 1, we are currently targeting Spark and Flink, two emerging frameworks for general-purpose big data analytics. While Spark appears to have a stronger critical mass, Flink supports high-level optimization for data management operations; both systems appear suited to support our domain-specific data management operations. In this paper, we focus on a comparison of the two frameworks at work based upon three typical genomic applications, stemming from our data management requirements and needs; we describe the coding of the genomic applications using Flink and Spark, discuss their common aspects and differences, and comparatively evaluate the performance and scalability of the implementations over datasets consisting of billions of genomic regions

Archivio istituzionale della ricerca - Politecnico di Milano

Framework for Supporting Genomic Operations

Author: Abdulrahman Kaitoua
Michele Bertoni
Pietro Pinoli
Stefano Ceri
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Evaluating genomic big data operations on SciDB and spark

Author: Cattani Simone
Ceri Stefano
Kaitoua Abdulrahman
Pinoli Pietro
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

We are developing a new, holistic data management system for genomics, which provides high-level abstractions for querying large genomic datasets. We designed our system so that it leverages on data management engines for low-level data access. Such design can be adapted to two different kinds of data engines: the family of scientific databases (among them, SciDB) and the broader family of generic platforms (among them, Spark). Trade-offs are not obvious; scientific databases are expected to outperform generic platforms when they use features which are embedded within their specialized design, but generic platforms are expected to outperform scientific databases on general purpose operations. In this paper, we compare our SciDB and Spark implementations at work on genomic abstractions. We use four typical genomic operations as benchmark, stemming from the concrete requirements of our project, and encoded using SciDB and Spark; we discuss their common aspects and differences, specifically discussing how genomic regions and operations can be expressed using SciDB arrays. We comparatively evaluate the performance and scalability of the two implementations over datasets consisting of billions of genomic regions

Archivio istituzionale della ricerca - Politecnico di Milano