160 research outputs found
Learning Topic Models - Going beyond SVD
Topic Modeling is an approach used for automatic comprehension and
classification of data in a variety of settings, and perhaps the canonical
application is in uncovering thematic structure in a corpus of documents. A
number of foundational works both in machine learning and in theory have
suggested a probabilistic model for documents, whereby documents arise as a
convex combination of (i.e. distribution on) a small number of topic vectors,
each topic vector being a distribution on words (i.e. a vector of
word-frequencies). Similar models have since been used in a variety of
application areas; the Latent Dirichlet Allocation or LDA model of Blei et al.
is especially popular.
Theoretical studies of topic modeling focus on learning the model's
parameters assuming the data is actually generated from it. Existing approaches
for the most part rely on Singular Value Decomposition(SVD), and consequently
have one of two limitations: these works need to either assume that each
document contains only one topic, or else can only recover the span of the
topic vectors instead of the topic vectors themselves.
This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main
tool in this context, which is an analog of SVD where all vectors are
nonnegative. Using this tool we give the first polynomial-time algorithm for
learning topic models without the above two limitations. The algorithm uses a
fairly mild assumption about the underlying topic matrix called separability,
which is usually found to hold in real-life data. A compelling feature of our
algorithm is that it generalizes to models that incorporate topic-topic
correlations, such as the Correlated Topic Model and the Pachinko Allocation
Model.
We hope that this paper will motivate further theoretical results that use
NMF as a replacement for SVD - just as NMF has come to replace SVD in many
applications
Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders
We present a new algorithm for Independent Component Analysis (ICA) which has
provable performance guarantees. In particular, suppose we are given samples of
the form where is an unknown matrix and is
a random variable whose components are independent and have a fourth moment
strictly less than that of a standard Gaussian random variable and is an
-dimensional Gaussian random variable with unknown covariance : We
give an algorithm that provable recovers and up to an additive
and whose running time and sample complexity are polynomial in
and . To accomplish this, we introduce a novel "quasi-whitening"
step that may be useful in other contexts in which the covariance of Gaussian
noise is not known in advance. We also give a general framework for finding all
local optima of a function (given an oracle for approximately finding just one)
and this is a crucial step in our algorithm, one that has been overlooked in
previous attempts, and allows us to control the accumulation of error when we
find the columns of one by one via local search
Simple, Efficient, and Neural Algorithms for Sparse Coding
Sparse coding is a basic task in many fields including signal processing,
neuroscience and machine learning where the goal is to learn a basis that
enables a sparse representation of a given set of data, if one exists. Its
standard formulation is as a non-convex optimization problem which is solved in
practice by heuristics based on alternating minimization. Re- cent work has
resulted in several algorithms for sparse coding with provable guarantees, but
somewhat surprisingly these are outperformed by the simple alternating
minimization heuristics. Here we give a general framework for understanding
alternating minimization which we leverage to analyze existing heuristics and
to design new ones also with provable guarantees. Some of these algorithms seem
implementable on simple neural architectures, which was the original motivation
of Olshausen and Field (1997a) in introducing sparse coding. We also give the
first efficient algorithm for sparse coding that works almost up to the
information theoretic limit for sparse recovery on incoherent dictionaries. All
previous algorithms that approached or surpassed this limit run in time
exponential in some natural parameter. Finally, our algorithms improve upon the
sample complexity of existing approaches. We believe that our analysis
framework will have applications in other settings where simple iterative
algorithms are used.Comment: 37 pages, 1 figur
Computing a Nonnegative Matrix Factorization -- Provably
In the Nonnegative Matrix Factorization (NMF) problem we are given an nonnegative matrix and an integer . Our goal is to express
as where and are nonnegative matrices of size
and respectively. In some applications, it makes sense to ask
instead for the product to approximate -- i.e. (approximately)
minimize \norm{M - AW}_F where \norm{}_F denotes the Frobenius norm; we
refer to this as Approximate NMF. This problem has a rich history spanning
quantum mechanics, probability theory, data analysis, polyhedral combinatorics,
communication complexity, demography, chemometrics, etc. In the past decade NMF
has become enormously popular in machine learning, where and are
computed using a variety of local search heuristics. Vavasis proved that this
problem is NP-complete. We initiate a study of when this problem is solvable in
polynomial time:
1. We give a polynomial-time algorithm for exact and approximate NMF for
every constant . Indeed NMF is most interesting in applications precisely
when is small.
2. We complement this with a hardness result, that if exact NMF can be solved
in time , 3-SAT has a sub-exponential time algorithm. This rules
out substantial improvements to the above algorithm.
3. We give an algorithm that runs in time polynomial in , and
under the separablity condition identified by Donoho and Stodden in 2003. The
algorithm may be practical since it is simple and noise tolerant (under benign
assumptions). Separability is believed to hold in many practical settings.
To the best of our knowledge, this last result is the first example of a
polynomial-time algorithm that provably works under a non-trivial condition on
the input and we believe that this will be an interesting and important
direction for future work.Comment: 29 pages, 3 figure
Sampling U(1) gauge theory using a re-trainable conditional flow-based model
Sampling topological quantities in the Monte Carlo simulation of Lattice
Gauge Theory becomes challenging as we approach the continuum limit of the
theory. In this work, we introduce a Conditional Normalizing Flow (C-NF) model
to sample U(1) gauge theory in two dimensions, aiming to mitigate the impact of
topological freezing when dealing with smaller values of the U(1) bare
coupling. To train the conditional flow model, we utilize samples generated by
Hybrid Monte Carlo (HMC) method, ensuring that the autocorrelation in
topological quantities remains low. Subsequently, we employ the trained model
to interpolate the coupling parameter to values where training was not
performed. We thoroughly examine the quality of the model in this region and
generate uncorrelated samples, significantly reducing the occurrence of
topological freezing. Furthermore, we propose a re-trainable approach that
utilizes the model's own samples to enhance the generalization capability of
the conditional model. This method enables sampling for coupling values that
are far beyond the initial training region, expanding the applicability of the
model
Simple, efficient, and neural algorithms for sparse coding
Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Re- cent work has resulted in several algorithms for sparse coding with provable guarantees, but somewhat surprisingly these are outperformed by the simple alternating minimization heuristics. Here we give a general framework for understanding alternating minimization which we leverage to analyze existing heuristics and to design new ones also with provable guarantees. Some of these algorithms seem implementable on simple neural architectures, which was the original motivation of Olshausen and Field (1997a) in introducing sparse coding. We also give the first efficient algorithm for sparse coding that works almost up to the information theoretic limit for sparse recovery on incoherent dictionaries. All previous algorithms that approached or surpassed this limit run in time exponential in some natural parameter. Finally, our algorithms improve upon the sample complexity of existing approaches. We believe that our analysis framework will have applications in other settings where simple iterative algorithms are used
A Survey to Estimate the Prevalence of Tooth Loss and Denture Wearers in Subjects of Different Age Groups of South Coastal Karnataka Region
INTRODUCTION: Advancement in age brings into countless new health problems along with the exacerbation of existing ones. Dental awareness has led to decrease in edentulousness in elderly people. The reasons of tooth loss also differs in different age groups.AIM & OBJECTIVES: The objectives of this study were to estimate the prevalence of tooth loss and denture wearers in various age groups along with the evaluation of the reasons for tooth loss.MATERIALS & METHOD: This questionnaire based study was conducted in the patients visiting the department of Prosthodontics Crown and Bridge & Implantology. Subjects were interviewed and examined clinically by a single examiner. A representative convenience sample of 150 patients in age groups of 40-50, 50-60 and 60-70 years were included in the study. Descriptive statistics were applied and the Chi-square test was used to analyse the findings using SPSS version 17.0.RESULTS: Tooth loss was found to be maximum in age group of 60-70 years and almost 64% wore complete dentures. The patients in age group of 40-50 years had maximum percentage of natural teeth (60%). The poor periodontal support was the main cause of tooth loss in almost 74% patients in age group of 60-70 years. Caries was the predominant cause of tooth loss in age group of 40-50 years.CONCLUSION: Prevalence of tooth loss and denture wearers is maximum in bigger age groups. Loss of periodontal support is the main cause of tooth loss as age advances while caries being the major cause in young individuals
A Practical Algorithm for Topic Modeling with Provable Guarantees
Topic models provide a useful method for dimensionality reduction and
exploratory data analysis in large text corpora. Most approaches to topic model
inference have been based on a maximum likelihood objective. Efficient
algorithms exist that approximate this objective, but they have no provable
guarantees. Recently, algorithms have been introduced that provide provable
bounds, but these algorithms are not practical because they are inefficient and
not robust to violations of model assumptions. In this paper we present an
algorithm for topic model inference that is both provable and practical. The
algorithm produces results comparable to the best MCMC implementations while
running orders of magnitude faster.Comment: 26 page
Provable algorithms for inference in topic models
Recently, there has been considerable progress on designing algorithms with provable guarantees - typically using linear algebraic methods - for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a single iteration of Gibbs sampling.National Science Foundation (U.S.) (CAREER Award CCF1453261)Google (Firm) (Faculty Research Award)NEC Corporatio
- …