63 research outputs found
Models and information-theoretic bounds for nanopore sequencing
Nanopore sequencing is an emerging new technology for sequencing DNA, which
can read long fragments of DNA (~50,000 bases) in contrast to most current
short-read sequencing technologies which can only read hundreds of bases. While
nanopore sequencers can acquire long reads, the high error rates (20%-30%) pose
a technical challenge. In a nanopore sequencer, a DNA is migrated through a
nanopore and current variations are measured. The DNA sequence is inferred from
this observed current pattern using an algorithm called a base-caller. In this
paper, we propose a mathematical model for the "channel" from the input DNA
sequence to the observed current, and calculate bounds on the information
extraction capacity of the nanopore sequencer. This model incorporates
impairments like (non-linear) inter-symbol interference, deletions, as well as
random response. These information bounds have two-fold application: (1) The
decoding rate with a uniform input distribution can be used to calculate the
average size of the plausible list of DNA sequences given an observed current
trace. This bound can be used to benchmark existing base-calling algorithms, as
well as serving a performance objective to design better nanopores. (2) When
the nanopore sequencer is used as a reader in a DNA storage system, the storage
capacity is quantified by our bounds
Learning Temporal Dependence from Time-Series Data with Latent Variables
We consider the setting where a collection of time series, modeled as random
processes, evolve in a causal manner, and one is interested in learning the
graph governing the relationships of these processes. A special case of wide
interest and applicability is the setting where the noise is Gaussian and
relationships are Markov and linear. We study this setting with two additional
features: firstly, each random process has a hidden (latent) state, which we
use to model the internal memory possessed by the variables (similar to hidden
Markov models). Secondly, each variable can depend on its latent memory state
through a random lag (rather than a fixed lag), thus modeling memory recall
with differing lags at distinct times. Under this setting, we develop an
estimator and prove that under a genericity assumption, the parameters of the
model can be learned consistently. We also propose a practical adaption of this
estimator, which demonstrates significant performance gains in both synthetic
and real-world datasets
ClusterGAN : Latent Space Clustering in Generative Adversarial Networks
Generative Adversarial networks (GANs) have obtained remarkable success in
many unsupervised learning tasks and unarguably, clustering is an important
unsupervised learning problem. While one can potentially exploit the
latent-space back-projection in GANs to cluster, we demonstrate that the
cluster structure is not retained in the GAN latent space.
In this paper, we propose ClusterGAN as a new mechanism for clustering using
GANs. By sampling latent variables from a mixture of one-hot encoded variables
and continuous latent variables, coupled with an inverse network (which
projects the data to the latent space) trained jointly with a clustering
specific loss, we are able to achieve clustering in the latent space. Our
results show a remarkable phenomenon that GANs can preserve latent space
interpolation across categories, even though the discriminator is never exposed
to such vectors. We compare our results with various clustering baselines and
demonstrate superior performance on both synthetic and real datasets.Comment: GANs, Clustering, Latent Space, Interpolation (v2 : Typos fixed, some
new experiments added, reported metrics on best validated model.
Minimum HGR Correlation Principle: From Marginals to Joint Distribution
Given low order moment information over the random variables and , what distribution minimizes the
Hirschfeld-Gebelein-R\'{e}nyi (HGR) maximal correlation coefficient between
and , while remains faithful to the given moments? The answer
to this question is important especially in order to fit models over
with minimum dependence among the random variables
and . In this paper, we investigate this question first in the
continuous setting by showing that the jointly Gaussian distribution achieves
the minimum HGR correlation coefficient among distributions with the given
first and second order moments. Then, we pose a similar question in the
discrete scenario by fixing the pairwise marginals of the random variables
and . To answer this question in the discrete setting, we first
derive a lower bound for the HGR correlation coefficient over the class of
distributions with fixed pairwise marginals. Then we show that this lower bound
is tight if there exists a distribution with certain {\it additive} structure
satisfying the given pairwise marginals. Moreover, the distribution with the
additive structure achieves the minimum HGR correlation coefficient. Finally,
we conclude by showing that the event of obtaining pairwise marginals
containing an additive structured distribution has a positive Lebesgue measure
over the probability simplex
- β¦