23,675 research outputs found
Accurate Profiling of Microbial Communities from Massively Parallel Sequencing using Convex Optimization
We describe the Microbial Community Reconstruction ({\bf MCR}) Problem, which
is fundamental for microbiome analysis. In this problem, the goal is to
reconstruct the identity and frequency of species comprising a microbial
community, using short sequence reads from Massively Parallel Sequencing (MPS)
data obtained for specified genomic regions. We formulate the problem
mathematically as a convex optimization problem and provide sufficient
conditions for identifiability, namely the ability to reconstruct species
identity and frequency correctly when the data size (number of reads) grows to
infinity. We discuss different metrics for assessing the quality of the
reconstructed solution, including a novel phylogenetically-aware metric based
on the Mahalanobis distance, and give upper-bounds on the reconstruction error
for a finite number of reads under different metrics. We propose a scalable
divide-and-conquer algorithm for the problem using convex optimization, which
enables us to handle large problems (with species). We show using
numerical simulations that for realistic scenarios, where the microbial
communities are sparse, our algorithm gives solutions with high accuracy, both
in terms of obtaining accurate frequency, and in terms of species phylogenetic
resolution.Comment: To appear in SPIRE 1
Convex Relaxations for Permutation Problems
Seriation seeks to reconstruct a linear order between variables using
unsorted, pairwise similarity information. It has direct applications in
archeology and shotgun gene sequencing for example. We write seriation as an
optimization problem by proving the equivalence between the seriation and
combinatorial 2-SUM problems on similarity matrices (2-SUM is a quadratic
minimization problem over permutations). The seriation problem can be solved
exactly by a spectral algorithm in the noiseless case and we derive several
convex relaxations for 2-SUM to improve the robustness of seriation solutions
in noisy settings. These convex relaxations also allow us to impose structural
constraints on the solution, hence solve semi-supervised seriation problems. We
derive new approximation bounds for some of these relaxations and present
numerical experiments on archeological data, Markov chains and DNA assembly
from shotgun gene sequencing data.Comment: Final journal version, a few typos and references fixe
Minimizing value-at-risk in the single-machine total weighted tardiness problem
The vast majority of the machine scheduling literature focuses on deterministic
problems, in which all data is known with certainty a priori. This may be a reasonable assumption when the variability in the problem parameters is low. However, as variability in the parameters increases incorporating this uncertainty explicitly into a scheduling model is essential to mitigate the resulting adverse effects. In this paper, we consider the celebrated single-machine total weighted tardiness (TWT) problem in the presence of uncertain problem parameters. We impose a probabilistic constraint on the random TWT and introduce a risk-averse stochastic programming model. In particular, the objective of the proposed model is to find a non-preemptive static job processing sequence that minimizes the value-at-risk (VaR) measure on the random
TWT at a specified confidence level. Furthermore, we develop a lower bound on the optimal VaR that may also benefit alternate solution approaches in the future. In this study, we implement a tabu-search heuristic to obtain reasonably good feasible solutions and present results to demonstrate the effect of the risk parameter and the value of the proposed model with respect to a corresponding risk-neutral approach
QuASeR -- Quantum Accelerated De Novo DNA Sequence Reconstruction
In this article, we present QuASeR, a reference-free DNA sequence
reconstruction implementation via de novo assembly on both gate-based and
quantum annealing platforms. Each one of the four steps of the implementation
(TSP, QUBO, Hamiltonians and QAOA) is explained with simple proof-of-concept
examples to target both the genomics research community and quantum application
developers in a self-contained manner. The details of the implementation are
discussed for the various layers of the quantum full-stack accelerator design.
We also highlight the limitations of current classical simulation and available
quantum hardware systems. The implementation is open-source and can be found on
https://github.com/prince-ph0en1x/QuASeR.Comment: 24 page
Ensemble Analysis of Adaptive Compressed Genome Sequencing Strategies
Acquiring genomes at single-cell resolution has many applications such as in
the study of microbiota. However, deep sequencing and assembly of all of
millions of cells in a sample is prohibitively costly. A property that can come
to rescue is that deep sequencing of every cell should not be necessary to
capture all distinct genomes, as the majority of cells are biological
replicates. Biologically important samples are often sparse in that sense. In
this paper, we propose an adaptive compressed method, also known as distilled
sensing, to capture all distinct genomes in a sparse microbial community with
reduced sequencing effort. As opposed to group testing in which the number of
distinct events is often constant and sparsity is equivalent to rarity of an
event, sparsity in our case means scarcity of distinct events in comparison to
the data size. Previously, we introduced the problem and proposed a distilled
sensing solution based on the breadth first search strategy. We simulated the
whole process which constrained our ability to study the behavior of the
algorithm for the entire ensemble due to its computational intensity. In this
paper, we modify our previous breadth first search strategy and introduce the
depth first search strategy. Instead of simulating the entire process, which is
intractable for a large number of experiments, we provide a dynamic programming
algorithm to analyze the behavior of the method for the entire ensemble. The
ensemble analysis algorithm recursively calculates the probability of capturing
every distinct genome and also the expected total sequenced nucleotides for a
given population profile. Our results suggest that the expected total sequenced
nucleotides grows proportional to of the number of cells and
proportional linearly with the number of distinct genomes
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
- …