Search CORE

9,213 research outputs found

The Parallelism Motifs of Genomic Data Analysis

Author: Awan Muaaz
Azad Ariful
Brock Benjamin
Buluc Aydin
Egan Rob
Ekanayake Saliya
Ellis Marquita
Georganas Evangelos
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Teodoropol Cristina
Yelick Katherine
Publication venue: 'The Royal Society'
Publication date: 20/01/2020
Field of study

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

arXiv.org e-Print Archive

eScholarship - University of California

Statistical modeling of RNA structure profiling experiments enables parsimonious reconstruction of structure landscapes.

Author: Aviran Sharon
Li Hua
Publication venue: eScholarship, University of California
Publication date: 01/02/2018
Field of study

RNA plays key regulatory roles in diverse cellular processes, where its functionality often derives from folding into and converting between structures. Many RNAs further rely on co-existence of alternative structures, which govern their response to cellular signals. However, characterizing heterogeneous landscapes is difficult, both experimentally and computationally. Recently, structure profiling experiments have emerged as powerful and affordable structure characterization methods, which improve computational structure prediction. To date, efforts have centered on predicting one optimal structure, with much less progress made on multiple-structure prediction. Here, we report a probabilistic modeling approach that predicts a parsimonious set of co-existing structures and estimates their abundances from structure profiling data. We demonstrate robust landscape reconstruction and quantitative insights into structural dynamics by analyzing numerous data sets. This work establishes a framework for data-directed characterization of structure landscapes to aid experimentalists in performing structure-function studies

Directory of Open Access Journals

eScholarship - University of California

Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies

Author: Shimagaki Kai
Weigt Martin
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2019
Field of study

Statistical models for families of evolutionary related proteins have recently gained interest: in particular pairwise Potts models, as those inferred by the Direct-Coupling Analysis, have been able to extract information about the three-dimensional structure of folded proteins, and about the effect of amino-acid substitutions in proteins. These models are typically requested to reproduce the one- and two-point statistics of the amino-acid usage in a protein family, {\em i.e.}~to capture the so-called residue conservation and covariation statistics of proteins of common evolutionary origin. Pairwise Potts models are the maximum-entropy models achieving this. While being successful, these models depend on huge numbers of {\em ad hoc} introduced parameters, which have to be estimated from finite amount of data and whose biophysical interpretation remains unclear. Here we propose an approach to parameter reduction, which is based on selecting collective sequence motifs. It naturally leads to the formulation of statistical sequence models in terms of Hopfield-Potts models. These models can be accurately inferred using a mapping to restricted Boltzmann machines and persistent contrastive divergence. We show that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models. The Hopfield patterns form interpretable sequence motifs and may be used to clusterize amino-acid sequences into functional sub-families. However, the distributed collective nature of these motifs intrinsically limits the ability of Hopfield-Potts models in predicting contact maps, showing the necessity of developing models going beyond the Hopfield-Potts models discussed here.Comment: 26 pages, 16 figures, to app. in PR

arXiv.org e-Print Archive

HAL Descartes

HAL-INSU

Hal-Diderot

Subgraphs in random networks

Author: A. Wagner
A.L. Barabasi
B.A. Huberman
C.A. Ouzounis
D. Watts
E. Bender
E. Ravasz
F. Chung
F. Harary
G. Bianconi
G. Bianconi
G. Ziv
J. Berg
J. Eckmann
J.G. White
L. Amaral
M. Faloutsos
M. Molloy
M. Molloy
M. Newman
M. Newman
M. Newman
N. Guelzim
N. Kashtan
P. Collet
P. Erdős
P. Erdős
P. Erdős
P.L. Krapivsky
P.W. Holland
R. Albert
R. Cohen
R. Cohen
R. Ferrer i Cancho
R. Ferrer i Cancho
R. Milo
R. Milo
S. Itzkovitz
S. Maslov
S. Redner
S. Shen-Orr
S.H. Strogatz
S.N. Dorogovtsev
S.N. Dorogovtsev
U. Alon
W. Aiello
Z. Burda
Publication venue: 'American Physical Society (APS)'
Publication date: 26/08/2003
Field of study

Understanding the subgraph distribution in random networks is important for modelling complex systems. In classic Erdos networks, which exhibit a Poissonian degree distribution, the number of appearances of a subgraph G with n nodes and g edges scales with network size as \mean{G} ~ N^{n-g}. However, many natural networks have a non-Poissonian degree distribution. Here we present approximate equations for the average number of subgraphs in an ensemble of random sparse directed networks, characterized by an arbitrary degree sequence. We find new scaling rules for the commonly occurring case of directed scale-free networks, in which the outgoing degree distribution scales as P(k) ~ k^{-\gamma}. Considering the power exponent of the degree distribution, \gamma, as a control parameter, we show that random networks exhibit transitions between three regimes. In each regime the subgraph number of appearances follows a different scaling law, \mean{G} ~ N^{\alpha}, where \alpha=n-g+s-1 for \gamma<2, \alpha=n-g+s+1-\gamma for 2<\gamma<\gamma_c, and \alpha=n-g for \gamma>\gamma_c, s is the maximal outdegree in the subgraph, and \gamma_c=s+1. We find that certain subgraphs appear much more frequently than in Erdos networks. These results are in very good agreement with numerical simulations. This has implications for detecting network motifs, subgraphs that occur in natural networks significantly more than in their randomized counterparts.Comment: 8 pages, 5 figure

arXiv.org e-Print Archive

Crossref

Rapid Sequence Identification of Potential Pathogens Using Techniques from Sparse Linear Algebra

Author: Chiu Nelson
Dodson Stephanie
Kepner Jeremy
Ricke Darrell O.
Shcherbina Anna
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/01/2015
Field of study

The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D

^{4}

RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The D

^{4}

RAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest

arXiv.org e-Print Archive

Crossref