Search CORE

23,675 research outputs found

Accurate Profiling of Microbial Communities from Massively Parallel Sequencing using Convex Optimization

Author: A. Amir
B.J. Paster
C. Lozupone
C.A. Lozupone
D. Hiller
D. Kessner
D.H. Haft
E.R. Mardis
I. Eskin
J.R. Cole
M. Grant
M. Hamady
N. Segata
P. Meinicke
P.B. Eckburg
S. Pavoine
T.J. Gentry
T.Z. DeSantis
Publication venue
Publication date: 01/01/2013
Field of study

We describe the Microbial Community Reconstruction ({\bf MCR}) Problem, which is fundamental for microbiome analysis. In this problem, the goal is to reconstruct the identity and frequency of species comprising a microbial community, using short sequence reads from Massively Parallel Sequencing (MPS) data obtained for specified genomic regions. We formulate the problem mathematically as a convex optimization problem and provide sufficient conditions for identifiability, namely the ability to reconstruct species identity and frequency correctly when the data size (number of reads) grows to infinity. We discuss different metrics for assessing the quality of the reconstructed solution, including a novel phylogenetically-aware metric based on the Mahalanobis distance, and give upper-bounds on the reconstruction error for a finite number of reads under different metrics. We propose a scalable divide-and-conquer algorithm for the problem using convex optimization, which enables us to handle large problems (with

\sim10^6

species). We show using numerical simulations that for realistic scenarios, where the microbial communities are sparse, our algorithm gives solutions with high accuracy, both in terms of obtaining accurate frequency, and in terms of species phylogenetic resolution.Comment: To appear in SPIRE 1

arXiv.org e-Print Archive

CiteSeerX

Crossref

Convex Relaxations for Permutation Problems

Author: Bach Francis
d'Aspremont Alexandre
Fogel Fajwel
Jenatton Rodolphe
Publication venue
Publication date: 06/02/2015
Field of study

Seriation seeks to reconstruct a linear order between variables using unsorted, pairwise similarity information. It has direct applications in archeology and shotgun gene sequencing for example. We write seriation as an optimization problem by proving the equivalence between the seriation and combinatorial 2-SUM problems on similarity matrices (2-SUM is a quadratic minimization problem over permutations). The seriation problem can be solved exactly by a spectral algorithm in the noiseless case and we derive several convex relaxations for 2-SUM to improve the robustness of seriation solutions in noisy settings. These convex relaxations also allow us to impose structural constraints on the solution, hence solve semi-supervised seriation problems. We derive new approximation bounds for some of these relaxations and present numerical experiments on archeological data, Markov chains and DNA assembly from shotgun gene sequencing data.Comment: Final journal version, a few typos and references fixe

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Polytechnique

Minimizing value-at-risk in the single-machine total weighted tardiness problem

Author: Atakan Semih
Bulbul Kerem
Bülbül Kerem
Noyan Nilay
Tezel Birce
Publication venue: MISTA 2011
Publication date: 01/05/2011
Field of study

The vast majority of the machine scheduling literature focuses on deterministic problems, in which all data is known with certainty a priori. This may be a reasonable assumption when the variability in the problem parameters is low. However, as variability in the parameters increases incorporating this uncertainty explicitly into a scheduling model is essential to mitigate the resulting adverse effects. In this paper, we consider the celebrated single-machine total weighted tardiness (TWT) problem in the presence of uncertain problem parameters. We impose a probabilistic constraint on the random TWT and introduce a risk-averse stochastic programming model. In particular, the objective of the proposed model is to find a non-preemptive static job processing sequence that minimizes the value-at-risk (VaR) measure on the random TWT at a specified confidence level. Furthermore, we develop a lower bound on the optimal VaR that may also benefit alternate solution approaches in the future. In this study, we implement a tabu-search heuristic to obtain reasonably good feasible solutions and present results to demonstrate the effect of the risk parameter and the value of the proposed model with respect to a corresponding risk-neutral approach

Sabanci University Research Database

QuASeR -- Quantum Accelerated De Novo DNA Sequence Reconstruction

Author: Al-Ars Zaid
Bertels Koen
Sarkar Aritra
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 10/04/2020
Field of study

In this article, we present QuASeR, a reference-free DNA sequence reconstruction implementation via de novo assembly on both gate-based and quantum annealing platforms. Each one of the four steps of the implementation (TSP, QUBO, Hamiltonians and QAOA) is explained with simple proof-of-concept examples to target both the genomics research community and quantum application developers in a self-contained manner. The details of the implementation are discussed for the various layers of the quantum full-stack accelerator design. We also highlight the limitations of current classical simulation and available quantum hardware systems. The implementation is open-source and can be found on https://github.com/prince-ph0en1x/QuASeR.Comment: 24 page

arXiv.org e-Print Archive

Directory of Open Access Journals

Ensemble Analysis of Adaptive Compressed Genome Sequencing Strategies

Author: Taghavi Zeinab
Publication venue
Publication date: 25/04/2014
Field of study

Acquiring genomes at single-cell resolution has many applications such as in the study of microbiota. However, deep sequencing and assembly of all of millions of cells in a sample is prohibitively costly. A property that can come to rescue is that deep sequencing of every cell should not be necessary to capture all distinct genomes, as the majority of cells are biological replicates. Biologically important samples are often sparse in that sense. In this paper, we propose an adaptive compressed method, also known as distilled sensing, to capture all distinct genomes in a sparse microbial community with reduced sequencing effort. As opposed to group testing in which the number of distinct events is often constant and sparsity is equivalent to rarity of an event, sparsity in our case means scarcity of distinct events in comparison to the data size. Previously, we introduced the problem and proposed a distilled sensing solution based on the breadth first search strategy. We simulated the whole process which constrained our ability to study the behavior of the algorithm for the entire ensemble due to its computational intensity. In this paper, we modify our previous breadth first search strategy and introduce the depth first search strategy. Instead of simulating the entire process, which is intractable for a large number of experiments, we provide a dynamic programming algorithm to analyze the behavior of the method for the entire ensemble. The ensemble analysis algorithm recursively calculates the probability of capturing every distinct genome and also the expected total sequenced nucleotides for a given population profile. Our results suggest that the expected total sequenced nucleotides grows proportional to

\log

of the number of cells and proportional linearly with the number of distinct genomes

arXiv.org e-Print Archive

Springer - Publisher Connector

Challenges of Big Data Analysis

Author: Fan Jianqing
Han Fang
Liu Han
Publication venue: 'Oxford University Press (OUP)'
Publication date: 06/02/2014
Field of study

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository