Search CORE

160 research outputs found

Learning Topic Models - Going beyond SVD

Author: Arora Sanjeev
Ge Rong
Moitra Ankur
Publication venue
Publication date: 01/01/2012
Field of study

Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model and the Pachinko Allocation Model. We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD - just as NMF has come to replace SVD in many applications

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

Crossref

Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders

Author: Arora Sanjeev
Ge Rong
Moitra Ankur
Sachdeva Sushant
Publication venue
Publication date: 01/01/2012
Field of study

We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form

y = Ax + \eta

where

A

is an unknown

n \times n

matrix and

x

is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable and

\eta

is an

n

-dimensional Gaussian random variable with unknown covariance

\Sigma

: We give an algorithm that provable recovers

A

and

\Sigma

up to an additive

\epsilon

and whose running time and sample complexity are polynomial in

n

and

1 / \epsilon

. To accomplish this, we introduce a novel "quasi-whitening" step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of

A

one by one via local search

arXiv.org e-Print Archive

CiteSeerX

Simple, Efficient, and Neural Algorithms for Sparse Coding

Author: Arora Sanjeev
Ge Rong
Ma Tengyu
Moitra Ankur
Publication venue
Publication date: 01/01/2015
Field of study

Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating minimization. Re- cent work has resulted in several algorithms for sparse coding with provable guarantees, but somewhat surprisingly these are outperformed by the simple alternating minimization heuristics. Here we give a general framework for understanding alternating minimization which we leverage to analyze existing heuristics and to design new ones also with provable guarantees. Some of these algorithms seem implementable on simple neural architectures, which was the original motivation of Olshausen and Field (1997a) in introducing sparse coding. We also give the first efficient algorithm for sparse coding that works almost up to the information theoretic limit for sparse recovery on incoherent dictionaries. All previous algorithms that approached or surpassed this limit run in time exponential in some natural parameter. Finally, our algorithms improve upon the sample complexity of existing approaches. We believe that our analysis framework will have applications in other settings where simple iterative algorithms are used.Comment: 37 pages, 1 figur

arXiv.org e-Print Archive

Princeton University Open Access Repository

Computing a Nonnegative Matrix Factorization -- Provably

Author: Arora Sanjeev
Ge Rong
Kannan Ravi
Moitra Ankur
Publication venue
Publication date: 03/11/2011
Field of study

In the Nonnegative Matrix Factorization (NMF) problem we are given an

n \times m

nonnegative matrix

M

and an integer

r > 0

. Our goal is to express

M

A W

where

A

and

W

are nonnegative matrices of size

n \times r

and

r \times m

respectively. In some applications, it makes sense to ask instead for the product

AW

to approximate

M

-- i.e. (approximately) minimize \norm{M - AW}_F where \norm{}_F denotes the Frobenius norm; we refer to this as Approximate NMF. This problem has a rich history spanning quantum mechanics, probability theory, data analysis, polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where

A

and

W

are computed using a variety of local search heuristics. Vavasis proved that this problem is NP-complete. We initiate a study of when this problem is solvable in polynomial time: 1. We give a polynomial-time algorithm for exact and approximate NMF for every constant

r

. Indeed NMF is most interesting in applications precisely when

r

is small. 2. We complement this with a hardness result, that if exact NMF can be solved in time

(nm)^{o(r)}

, 3-SAT has a sub-exponential time algorithm. This rules out substantial improvements to the above algorithm. 3. We give an algorithm that runs in time polynomial in

n

m

and

r

under the separablity condition identified by Donoho and Stodden in 2003. The algorithm may be practical since it is simple and noise tolerant (under benign assumptions). Separability is believed to hold in many practical settings. To the best of our knowledge, this last result is the first example of a polynomial-time algorithm that provably works under a non-trivial condition on the input and we believe that this will be an interesting and important direction for future work.Comment: 29 pages, 3 figure

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

Crossref

Sampling U(1) gauge theory using a re-trainable conditional flow-based model

Author: Arora Vipul
Chakrabarti Dipankar
Singha Ankur
Publication venue
Publication date: 01/06/2023
Field of study

Sampling topological quantities in the Monte Carlo simulation of Lattice Gauge Theory becomes challenging as we approach the continuum limit of the theory. In this work, we introduce a Conditional Normalizing Flow (C-NF) model to sample U(1) gauge theory in two dimensions, aiming to mitigate the impact of topological freezing when dealing with smaller values of the U(1) bare coupling. To train the conditional flow model, we utilize samples generated by Hybrid Monte Carlo (HMC) method, ensuring that the autocorrelation in topological quantities remains low. Subsequently, we employ the trained model to interpolate the coupling parameter to values where training was not performed. We thoroughly examine the quality of the model in this region and generate uncorrelated samples, significantly reducing the occurrence of topological freezing. Furthermore, we propose a re-trainable approach that utilizes the model's own samples to enhance the generalization capability of the conditional model. This method enables sampling for coupling values that are far beyond the initial training region, expanding the applicability of the model

arXiv.org e-Print Archive

Simple, efficient, and neural algorithms for sparse coding

Author: Arora Sanjeev
Ge Rong
Ma Tengyu
Moitra Ankur
Publication venue: Proceedings of Machine Learning Research
Publication date: 29/05/2018
Field of study

DSpace@MIT

A Survey to Estimate the Prevalence of Tooth Loss and Denture Wearers in Subjects of Different Age Groups of South Coastal Karnataka Region

Author: Ankur Sabharwal
Ravneet Malhi
Sameksha Arora
Shreya Sabharwal
Publication venue: Vatsul Sharma
Publication date: 10/06/2017
Field of study

INTRODUCTION: Advancement in age brings into countless new health problems along with the exacerbation of existing ones. Dental awareness has led to decrease in edentulousness in elderly people. The reasons of tooth loss also differs in different age groups.AIM & OBJECTIVES: The objectives of this study were to estimate the prevalence of tooth loss and denture wearers in various age groups along with the evaluation of the reasons for tooth loss.MATERIALS & METHOD: This questionnaire based study was conducted in the patients visiting the department of Prosthodontics Crown and Bridge & Implantology. Subjects were interviewed and examined clinically by a single examiner. A representative convenience sample of 150 patients in age groups of 40-50, 50-60 and 60-70 years were included in the study. Descriptive statistics were applied and the Chi-square test was used to analyse the findings using SPSS version 17.0.RESULTS: Tooth loss was found to be maximum in age group of 60-70 years and almost 64% wore complete dentures. The patients in age group of 40-50 years had maximum percentage of natural teeth (60%). The poor periodontal support was the main cause of tooth loss in almost 74% patients in age group of 60-70 years. Caries was the predominant cause of tooth loss in age group of 40-50 years.CONCLUSION: Prevalence of tooth loss and denture wearers is maximum in bigger age groups. Loss of periodontal support is the main cause of tooth loss as age advances while caries being the major cause in young individuals

International Healthcare Research Journal (IHRJ)

A Practical Algorithm for Topic Modeling with Provable Guarantees

Author: Arora Sanjeev
Ge Rong
Halpern Yoni
Mimno David
Moitra Ankur
Sontag David
Wu Yichen
Zhu Michael
Publication venue
Publication date: 19/12/2012
Field of study

Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for topic model inference that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.Comment: 26 page

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

Provable algorithms for inference in topic models

Author: Arora Sanjeev
Ge Rong
Koehler Frederic
Ma Tengyu
Moitra Ankur
Publication venue: PMLR
Publication date: 29/05/2018
Field of study

Recently, there has been considerable progress on designing algorithms with provable guarantees - typically using linear algebraic methods - for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a single iteration of Gibbs sampling.National Science Foundation (U.S.) (CAREER Award CCF1453261)Google (Firm) (Faculty Research Award)NEC Corporatio

DSpace@MIT