Metagenome assembly is the process of transforming a set of short,
overlapping, and potentially erroneous DNA segments from environmental samples
into the accurate representation of the underlying microbiomes's genomes.
State-of-the-art tools require big shared memory machines and cannot handle
contemporary metagenome datasets that exceed Terabytes in size. In this paper,
we introduce the MetaHipMer pipeline, a high-quality and high-performance
metagenome assembler that employs an iterative de Bruijn graph approach.
MetaHipMer leverages a specialized scaffolding algorithm that produces long
scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is
end-to-end parallelized using the Unified Parallel C language and therefore can
run seamlessly on shared and distributed-memory systems. Experimental results
show that MetaHipMer matches or outperforms the state-of-the-art tools in terms
of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and
is able to assemble previously intractable grand challenge metagenomes. We
demonstrate the unprecedented capability of MetaHipMer by computing the first
full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion
reads - size 2.6 TBytes.Comment: Accepted to SC1

Arndt, Bill

Buluc, Aydin

Egan, Rob

Georganas, Evangelos

Goltsman, Eugene

Hofmeyr, Steven

Oliker, Leonid

Tritt, Andrew

Yelick, Katherine

English

arXiv

Crossref

Extreme Scale De Novo Metagenome Assembly

Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that MetaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. We demonstrate the unprecedented capability of MetaHipMer by computing the first full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion reads – size 2.6 TBytes

Hofrneyr, Steven

Buluç, Aydin

eScholarship - University of California

Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that MetaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. We demonstrate the unprecedented capability of MetaHipMer by computing the first full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion reads - size 2.6 TBytes

Hofmeyr, Steven A

Yelick, Katherine A

Extreme scale de novo metagenome assembly.

From the Foreword: “The authors of the chapters in this book are the pioneers who will explore the exascale frontier. The path forward will not be easy… These authors, along with their colleagues who will produce these powerful computer systems will, with dedication and determination, overcome the scalability problem, discover the new algorithms needed to achieve exascale performance for the broad range of applications that they represent, and create the new tools needed to support the development of scalable and portable science and engineering applications. Although the focus is on exascale computers, the benefits will permeate all of science and engineering because the technologies developed for the exascale computers of tomorrow will also power the petascale servers and terascale workstations of tomorrow. These affordable computing capabilities will empower scientists and engineers everywhere.” - Thom H. Dunning, Jr., Pacific Northwest National Laboratory and University of Washington, Seattle, Washington, USA “This comprehensive summary of applications targeting Exascale at the three DoE labs is a must read.” - Rio Yokota, Tokyo Institute of Technology, Tokyo, Japan “Numerical simulation is now a need in many fields of science, technology, and industry. The complexity of the simulated systems coupled with the massive use of data makes HPC essential to move towards predictive simulations. Advances in computer architecture have so far permitted scientific advances, but at the cost of continually adapting algorithms and applications. The next technological breakthroughs force us to rethink the applications by taking energy consumption into account. These profound modifications require not only anticipation and sharing but also a paradigm shift in application design to ensure the sustainability of developments by guaranteeing a certain independence of the applications to the profound modifications of the architectures: it is the passage from optimal performance to the portability of performance. It is the challenge of this book to demonstrate by example the approach that one can adopt for the development of applications offering performance portability in spite of the profound changes of the computing architectures.” - Christophe Calvin, CEA, Fundamental Research Division, Saclay, France “Three editors, one from each of the High Performance Computer Centers at Lawrence Berkeley, Argonne, and Oak Ridge National Laboratories, have compiled a very useful set of chapters aimed at describing software developments for the next generation exa-scale computers. Such a book is needed for scientists and engineers to see where the field is going and how they will be able to exploit such architectures for their own work. The book will also benefit students as it provides insights into how to develop software for such computer architectures. Overall, this book fills an important need in showing how to design and implement algorithms for exa-scale architectures which are heterogeneous and have unique memory systems. The book discusses issues with developing user codes for these architectures and how to address these issues including actual coding examples.” - Dr. David A. Dixon, Robert Ramsay Chair, The University of Alabama, Tuscaloosa, Alabama, USA

Extreme Scale De Novo Metagenome Assembly

Abstract

Similar works

Full text

Available Versions

Crossref

eScholarship - University of California

eScholarship - University of California

eScholarship - University of California