170,852 research outputs found
A new SVD approach to optimal topic estimation
In the probabilistic topic models, the quantity of interest---a low-rank
matrix consisting of topic vectors---is hidden in the text corpus matrix,
masked by noise, and Singular Value Decomposition (SVD) is a potentially useful
tool for learning such a matrix. However, different rows and columns of the
matrix are usually in very different scales and the connection between this
matrix and the singular vectors of the text corpus matrix are usually
complicated and hard to spell out, so how to use SVD for learning topic models
faces challenges.
We overcome the challenges by introducing a proper Pre-SVD normalization of
the text corpus matrix and a proper column-wise scaling for the matrix of
interest, and by revealing a surprising Post-SVD low-dimensional {\it simplex}
structure. The simplex structure, together with the Pre-SVD normalization and
column-wise scaling, allows us to conveniently reconstruct the matrix of
interest, and motivates a new SVD-based approach to learning topic models.
We show that under the popular probabilistic topic model \citep{hofmann1999},
our method has a faster rate of convergence than existing methods in a wide
variety of cases. In particular, for cases where documents are long or is
much larger than , our method achieves the optimal rate. At the heart of the
proofs is a tight element-wise bound on singular vectors of a multinomially
distributed data matrix, which do not exist in literature and we have to derive
by ourself.
We have applied our method to two data sets, Associated Process (AP) and
Statistics Literature Abstract (SLA), with encouraging results. In particular,
there is a clear simplex structure associated with the SVD of the data
matrices, which largely validates our discovery.Comment: 73 pages, 8 figures, 6 tables; considered two different VH algorithm,
OVH and GVH, and provided theoretical analysis for each algorithm;
re-organized upper bound theory part; added the subsection of comparing error
rate with other existing methods; provided another improved version of error
analysis through Bernstein inequality for martingale
Formal Verification of Probabilistic SystemC Models with Statistical Model Checking
Transaction-level modeling with SystemC has been very successful in
describing the behavior of embedded systems by providing high-level executable
models, in which many of them have inherent probabilistic behaviors, e.g.,
random data and unreliable components. It thus is crucial to have both
quantitative and qualitative analysis of the probabilities of system
properties. Such analysis can be conducted by constructing a formal model of
the system under verification and using Probabilistic Model Checking (PMC).
However, this method is infeasible for large systems, due to the state space
explosion. In this article, we demonstrate the successful use of Statistical
Model Checking (SMC) to carry out such analysis directly from large SystemC
models and allow designers to express a wide range of useful properties. The
first contribution of this work is a framework to verify properties expressed
in Bounded Linear Temporal Logic (BLTL) for SystemC models with both timed and
probabilistic characteristics. Second, the framework allows users to expose a
rich set of user-code primitives as atomic propositions in BLTL. Moreover,
users can define their own fine-grained time resolution rather than the
boundary of clock cycles in the SystemC simulation. The third contribution is
an implementation of a statistical model checker. It contains an automatic
monitor generation for producing execution traces of the
model-under-verification (MUV), the mechanism for automatically instrumenting
the MUV, and the interaction with statistical model checking algorithms.Comment: Journal of Software: Evolution and Process. Wiley, 2017. arXiv admin
note: substantial text overlap with arXiv:1507.0818
: Robust Principal Component Analysis for Exponential Family Distributions
Robust Principal Component Analysis (RPCA) is a widely used method for
recovering low-rank structure from data matrices corrupted by significant and
sparse outliers. These corruptions may arise from occlusions, malicious
tampering, or other causes for anomalies, and the joint identification of such
corruptions with low-rank background is critical for process monitoring and
diagnosis. However, existing RPCA methods and their extensions largely do not
account for the underlying probabilistic distribution for the data matrices,
which in many applications are known and can be highly non-Gaussian. We thus
propose a new method called Robust Principal Component Analysis for Exponential
Family distributions (), which can perform the desired
decomposition into low-rank and sparse matrices when such a distribution falls
within the exponential family. We present a novel alternating direction method
of multiplier optimization algorithm for efficient
decomposition. The effectiveness of is then demonstrated in
two applications: the first for steel sheet defect detection, and the second
for crime activity monitoring in the Atlanta metropolitan area
Feature extraction for document image segmentation by pLSA model
In this paper, we propose a method for document image segmentation based on pLSA (probabilistic latent semantic analysis) model. The pLSA model is originally developed for topic discovery in text analysis using "bag-of-words" document representation. The model is useful for image analysis by "bag-of-visual words" image representation. The performance of the method depends on the visual vocabulary generated by feature extraction from the document image. We compare several feature extraction and description methods, and examine the relations to segmentation performance. Through the experiments, we show accurate content-based document segmentation is made possible by using pLSA-based method.ArticleThe Eighth IAPR Workshop on Document Analysis Systemsconference pape
Temporal analysis of text data using latent variable models
Detecting and tracking of temporal data is an important task in multiple applications. In this paper we study temporal text mining methods for Music Information Retrieval. We compare two ways of detecting the temporal latent semantics of a corpus extracted from Wikipedia, using a stepwise Probabilistic Latent Semantic Analysis (PLSA) approach and a global multiway PLSA method. The analysis indicates that the global analysis method is able to identify relevant trends which are difficult to get using a step-by-step approach. Furthermore we show that inspection of PLSA models with different number of factors may reveal the stability of temporal clusters making it possible to choose the relevant number of factors. 1
- …