Search CORE

6 research outputs found

Analysing sequencing data in Hadoop: The road to interactivity via SQL

Author: Niemenmaa Matti
Publication venue
Publication date: 01/01/2013
Field of study

Analysis of high volumes of data has always been performed with distributed computing on computer clusters. But due to rapidly increasing data amounts in, for example, DNA sequencing, new approaches to data analysis are needed. Warehouse-scale computing environments with up to tens of thousands of networked nodes may be necessary to solve future Big Data problems related to sequencing data analysis. And to utilize such systems effectively, specialized software is needed. Hadoop is a collection of software built specifically for Big Data processing, with a core consisting of the Hadoop MapReduce scalable distributed computing platform and the Hadoop Distributed File System, HDFS. This work explains the principles underlying Hadoop MapReduce and HDFS as well as certain prominent higher-level interfaces to them: Pig, Hive, and HBase. An overview of the current state of Hadoop usage in bioinformatics is then provided alongside brief introductions to the Hadoop-BAM and SeqPig projects of the author and his colleagues. Data analysis tasks are often performed interactively, exploring the data sets at hand in order to familiarize oneself with them in preparation for well targeted long-running computations. Hadoop MapReduce is optimized for throughput instead of latency, making it a poor fit for interactive use. This Thesis presents two high-level alternatives designed especially with interactive data analysis in mind: Shark and Impala, both of which are Hive-compatible SQL-based systems. Aside from the computational framework used, the format in which the data sets are stored can greatly affect analytical performance. Thus new file formats are being developed to better cope with the needs of modern and future Big Data sets. This work analyses the current state of the art storage formats used in the worlds of bioinformatics and Hadoop. Finally, this Thesis presents the results of experiments performed by the author with the goal of understanding how well the landscape of available frameworks and storage formats can tackle interactive sequencing data analysis tasks

Aaltodoc Publication Archive

Scripting for large-scale sequencing based on Hadoop

Author: Heljanko Keijo
Kallio Aleksi
Korpelainen Eija
Niemenmaa Matti
Pireddu Luca
Schumacher André
Zanetti Gianluigi
Publication venue: 'EMBnet Stichting'
Publication date: 01/01/2013
Field of study

The large volumes of data generated by modern sequencing experiments present significant challenges in their manipulation and analysis. Traditional approaches are often found to be complicated to scale. We describe our ongoing work on SeqPig, a tool that facilitates the use of the Pig Latin distributed scripting language to manipulate, analyze and query sequencing data applying the advances motivated by the “big data revolution” in data-intensive activities. SeqPig provides access to popular data formats and implements a number of custom sequencing-specific functions. Most importantly, it grants users access to the scalable Hadoop platform from a high level scripting language84-85Pubblicat

P-arch

Hadoop-BAM: directly manipulating next generation sequencing data in the cloud

Author: Aleksi Kallio
André Schumacher
Dean
Eija Korpelainen
Kallio
Keijo Heljanko
Li
Matti Niemenmaa
McKenna
O'Connor
Olston
Petri Klemelä
Pireddu
Taylor
Thusoo
White
Publication venue: Oxford University Press
Publication date
Field of study

Summary: Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps

Crossref

PubMed Central

Tarmo: A Framework for Parallelized Bounded Model Checking

Author: Alessandro Cimatti
Antti E. J. Hyvärinen
Antti Eero Johannes Hyvärinen
Armin Biere
Armin Biere
Armin Biere
Daniel Le Berre
Dimosthenis Mpekas
Erika Ábrahám
Hantao Zhang
Jaco van de Pol
Keijo Heljanko
Keijo Heljanko
Lubos Brim
Martin Davis
Matthew W. Moskewicz
Matti Niemenmaa
Niklas Eén
Niklas Eén
Siert Wieringa
Tobias Schubert
William Gropp
Youssef Hamadi
Publication venue: 'Open Publishing Association'
Publication date: 01/12/2009
Field of study

This paper investigates approaches to parallelizing Bounded Model Checking (BMC) for shared memory environments as well as for clusters of workstations. We present a generic framework for parallelized BMC named Tarmo. Our framework can be used with any incremental SAT encoding for BMC but for the results in this paper we use only the current state-of-the-art encoding for full PLTL. Using this encoding allows us to check both safety and liveness properties, contrary to an earlier work on distributing BMC that is limited to safety properties only. Despite our focus on BMC after it has been translated to SAT, existing distributed SAT solvers are not well suited for our application. This is because solving a BMC problem is not solving a set of independent SAT instances but rather involves solving multiple related SAT instances, encoded incrementally, where the satisfiability of each instance corresponds to the existence of a counterexample of a specific length. Our framework includes a generic architecture for a shared clause database that allows easy clause sharing between SAT solver threads solving various such instances. We present extensive experimental results obtained with multiple variants of our Tarmo implementation. Our shared memory variants have a significantly better performance than conventional single threaded approaches, which is a result that many users can benefit from as multi-core and multi-processor technology is widely available. Furthermore we demonstrate that our framework can be deployed in a typical cluster of workstations, where several multi-core machines are connected by a network

arXiv.org e-Print Archive

CiteSeerX

Crossref

Directory of Open Access Journals

Interactivity for Big Data: Preprocessing genomic data with MapReduce

Author: Niemenmaa Matti
Publication venue
Publication date: 01/01/2011
Field of study

Aaltodoc Publication Archive