473 research outputs found
Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions
Nanopore sequencing technology has the potential to render other sequencing
technologies obsolete with its ability to generate long reads and provide
portability. However, high error rates of the technology pose a challenge while
generating accurate genome assemblies. The tools used for nanopore sequence
analysis are of critical importance as they should overcome the high error
rates of the technology. Our goal in this work is to comprehensively analyze
current publicly available tools for nanopore sequence analysis to understand
their advantages, disadvantages, and performance bottlenecks. It is important
to understand where the current tools do not perform well to develop better
tools. To this end, we 1) analyze the multiple steps and the associated tools
in the genome assembly pipeline using nanopore sequence data, and 2) provide
guidelines for determining the appropriate tools for each step. We analyze
various combinations of different tools and expose the tradeoffs between
accuracy, performance, memory usage and scalability. We conclude that our
observations can guide researchers and practitioners in making conscious and
effective choices for each step of the genome assembly pipeline using nanopore
sequence data. Also, with the help of bottlenecks we have found, developers can
improve the current tools or build new ones that are both accurate and fast, in
order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201
DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Integrated data analysis (IDA) pipelines—that combine data management (DM) and query processing, high-performance computing
(HPC), and machine learning (ML) training and scoring—become
increasingly common in practice. Interestingly, systems of these
areas share many compilation and runtime techniques, and the
used—increasingly heterogeneous—hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource
management, data formats and representations, as well as execution
strategies differ substantially. DAPHNE is an open and extensible
system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for
increasing productivity and eliminating unnecessary overheads. In
this paper, we make a case for IDA pipelines, describe the overall
DAPHNE system architecture, its key components, and the design
of a vectorized execution engine for computational storage, HW
accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas,
DuckDB, and TensorFlow show promising results
Recommended from our members
Theory and practice of classical matrix-matrix multiplication for hierarchical memory architectures
Matrix-matrix multiplication is perhaps the most important operation used as a
basic building block in dense linear algebra. A computer with a hierarchical memory architectures has memory that is organized in layers, with small and fast memories close to the processor, and big and slow memories further away from it.
Classical matrix-matrix multiplication is an operation particularly suited for such architectures, as it exhibits a large degree of data reuse, so expensive data movements can be amortized over a lot of computation. This dissertation advances the theory of
how to optimally reuse data during matrix-matrix multiplication on hierarchical memory architectures, and it uses this understanding to develop new practical algorithms for matrix-matrix multiplication that exhibit improved properties related to data movement.Computer Science
- …