9,213 research outputs found
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Statistical modeling of RNA structure profiling experiments enables parsimonious reconstruction of structure landscapes.
RNA plays key regulatory roles in diverse cellular processes, where its functionality often derives from folding into and converting between structures. Many RNAs further rely on co-existence of alternative structures, which govern their response to cellular signals. However, characterizing heterogeneous landscapes is difficult, both experimentally and computationally. Recently, structure profiling experiments have emerged as powerful and affordable structure characterization methods, which improve computational structure prediction. To date, efforts have centered on predicting one optimal structure, with much less progress made on multiple-structure prediction. Here, we report a probabilistic modeling approach that predicts a parsimonious set of co-existing structures and estimates their abundances from structure profiling data. We demonstrate robust landscape reconstruction and quantitative insights into structural dynamics by analyzing numerous data sets. This work establishes a framework for data-directed characterization of structure landscapes to aid experimentalists in performing structure-function studies
Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies
Statistical models for families of evolutionary related proteins have
recently gained interest: in particular pairwise Potts models, as those
inferred by the Direct-Coupling Analysis, have been able to extract information
about the three-dimensional structure of folded proteins, and about the effect
of amino-acid substitutions in proteins. These models are typically requested
to reproduce the one- and two-point statistics of the amino-acid usage in a
protein family, {\em i.e.}~to capture the so-called residue conservation and
covariation statistics of proteins of common evolutionary origin. Pairwise
Potts models are the maximum-entropy models achieving this. While being
successful, these models depend on huge numbers of {\em ad hoc} introduced
parameters, which have to be estimated from finite amount of data and whose
biophysical interpretation remains unclear. Here we propose an approach to
parameter reduction, which is based on selecting collective sequence motifs. It
naturally leads to the formulation of statistical sequence models in terms of
Hopfield-Potts models. These models can be accurately inferred using a mapping
to restricted Boltzmann machines and persistent contrastive divergence. We show
that, when applied to protein data, even 20-40 patterns are sufficient to
obtain statistically close-to-generative models. The Hopfield patterns form
interpretable sequence motifs and may be used to clusterize amino-acid
sequences into functional sub-families. However, the distributed collective
nature of these motifs intrinsically limits the ability of Hopfield-Potts
models in predicting contact maps, showing the necessity of developing models
going beyond the Hopfield-Potts models discussed here.Comment: 26 pages, 16 figures, to app. in PR
Subgraphs in random networks
Understanding the subgraph distribution in random networks is important for
modelling complex systems. In classic Erdos networks, which exhibit a
Poissonian degree distribution, the number of appearances of a subgraph G with
n nodes and g edges scales with network size as \mean{G} ~ N^{n-g}. However,
many natural networks have a non-Poissonian degree distribution. Here we
present approximate equations for the average number of subgraphs in an
ensemble of random sparse directed networks, characterized by an arbitrary
degree sequence. We find new scaling rules for the commonly occurring case of
directed scale-free networks, in which the outgoing degree distribution scales
as P(k) ~ k^{-\gamma}. Considering the power exponent of the degree
distribution, \gamma, as a control parameter, we show that random networks
exhibit transitions between three regimes. In each regime the subgraph number
of appearances follows a different scaling law, \mean{G} ~ N^{\alpha}, where
\alpha=n-g+s-1 for \gamma<2, \alpha=n-g+s+1-\gamma for 2<\gamma<\gamma_c, and
\alpha=n-g for \gamma>\gamma_c, s is the maximal outdegree in the subgraph, and
\gamma_c=s+1. We find that certain subgraphs appear much more frequently than
in Erdos networks. These results are in very good agreement with numerical
simulations. This has implications for detecting network motifs, subgraphs that
occur in natural networks significantly more than in their randomized
counterparts.Comment: 8 pages, 5 figure
Rapid Sequence Identification of Potential Pathogens Using Techniques from Sparse Linear Algebra
The decreasing costs and increasing speed and accuracy of DNA sample
collection, preparation, and sequencing has rapidly produced an enormous volume
of genetic data. However, fast and accurate analysis of the samples remains a
bottleneck. Here we present DRAGenS, a genetic sequence identification
algorithm that exhibits the Big Data handling and computational power of the
Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear
algebra and statistical properties to increase computational performance while
retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield
speed and precision tradeoffs, with applications in biodefense and medical
diagnostics. The DRAGenS analysis algorithm is tested over several
datasets, including three utilized for the Defense Threat Reduction Agency
(DTRA) metagenomic algorithm contest
- …