6,543 research outputs found
Genetic Sequence Matching Using D4M Big Data Approaches
Recent technological advances in Next Generation Sequencing tools have led to
increasing speeds of DNA sample collection, preparation, and sequencing. One
instrument can produce over 600 Gb of genetic sequence data in a single run.
This creates new opportunities to efficiently handle the increasing workload.
We propose a new method of fast genetic sequence analysis using the Dynamic
Distributed Dimensional Data Model (D4M) - an associative array environment for
MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and
statistical properties, the method leverages big data techniques and the
implementation of an Apache Acculumo database to accelerate computations
one-hundred fold over other methods. Comparisons of the D4M method with the
current gold-standard for sequence analysis, BLAST, show the two are comparable
in the alignments they find. This paper will present an overview of the D4M
genetic sequence algorithm and statistical comparisons with BLAST.Comment: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC)
201
Rapid Sequence Identification of Potential Pathogens Using Techniques from Sparse Linear Algebra
The decreasing costs and increasing speed and accuracy of DNA sample
collection, preparation, and sequencing has rapidly produced an enormous volume
of genetic data. However, fast and accurate analysis of the samples remains a
bottleneck. Here we present DRAGenS, a genetic sequence identification
algorithm that exhibits the Big Data handling and computational power of the
Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear
algebra and statistical properties to increase computational performance while
retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield
speed and precision tradeoffs, with applications in biodefense and medical
diagnostics. The DRAGenS analysis algorithm is tested over several
datasets, including three utilized for the Defense Threat Reduction Agency
(DTRA) metagenomic algorithm contest
Frustration in Biomolecules
Biomolecules are the prime information processing elements of living matter.
Most of these inanimate systems are polymers that compute their structures and
dynamics using as input seemingly random character strings of their sequence,
following which they coalesce and perform integrated cellular functions. In
large computational systems with a finite interaction-codes, the appearance of
conflicting goals is inevitable. Simple conflicting forces can lead to quite
complex structures and behaviors, leading to the concept of "frustration" in
condensed matter. We present here some basic ideas about frustration in
biomolecules and how the frustration concept leads to a better appreciation of
many aspects of the architecture of biomolecules, and how structure connects to
function. These ideas are simultaneously both seductively simple and perilously
subtle to grasp completely. The energy landscape theory of protein folding
provides a framework for quantifying frustration in large systems and has been
implemented at many levels of description. We first review the notion of
frustration from the areas of abstract logic and its uses in simple condensed
matter systems. We discuss then how the frustration concept applies
specifically to heteropolymers, testing folding landscape theory in computer
simulations of protein models and in experimentally accessible systems.
Studying the aspects of frustration averaged over many proteins provides ways
to infer energy functions useful for reliable structure prediction. We discuss
how frustration affects folding, how a large part of the biological functions
of proteins are related to subtle local frustration effects and how frustration
influences the appearance of metastable states, the nature of binding
processes, catalysis and allosteric transitions. We hope to illustrate how
Frustration is a fundamental concept in relating function to structural
biology.Comment: 97 pages, 30 figure
Protein Structure Prediction Using Basin-Hopping
Associative memory Hamiltonian structure prediction potentials are not overly
rugged, thereby suggesting their landscapes are like those of actual proteins.
In the present contribution we show how basin-hopping global optimization can
identify low-lying minima for the corresponding mildly frustrated energy
landscapes. For small systems the basin-hopping algorithm succeeds in locating
both lower minima and conformations closer to the experimental structure than
does molecular dynamics with simulated annealing. For large systems the
efficiency of basin-hopping decreases for our initial implementation, where the
steps consist of random perturbations to the Cartesian coordinates. We
implemented umbrella sampling using basin-hopping to further confirm when the
global minima are reached. We have also improved the energy surface by
employing bioinformatic techniques for reducing the roughness or variance of
the energy surface. Finally, the basin-hopping calculations have guided
improvements in the excluded volume of the Hamiltonian, producing better
structures. These results suggest a novel and transferable optimization scheme
for future energy function development
- …