170,852 research outputs found

    A new SVD approach to optimal topic estimation

    Full text link
    In the probabilistic topic models, the quantity of interest---a low-rank matrix consisting of topic vectors---is hidden in the text corpus matrix, masked by noise, and Singular Value Decomposition (SVD) is a potentially useful tool for learning such a matrix. However, different rows and columns of the matrix are usually in very different scales and the connection between this matrix and the singular vectors of the text corpus matrix are usually complicated and hard to spell out, so how to use SVD for learning topic models faces challenges. We overcome the challenges by introducing a proper Pre-SVD normalization of the text corpus matrix and a proper column-wise scaling for the matrix of interest, and by revealing a surprising Post-SVD low-dimensional {\it simplex} structure. The simplex structure, together with the Pre-SVD normalization and column-wise scaling, allows us to conveniently reconstruct the matrix of interest, and motivates a new SVD-based approach to learning topic models. We show that under the popular probabilistic topic model \citep{hofmann1999}, our method has a faster rate of convergence than existing methods in a wide variety of cases. In particular, for cases where documents are long or nn is much larger than pp, our method achieves the optimal rate. At the heart of the proofs is a tight element-wise bound on singular vectors of a multinomially distributed data matrix, which do not exist in literature and we have to derive by ourself. We have applied our method to two data sets, Associated Process (AP) and Statistics Literature Abstract (SLA), with encouraging results. In particular, there is a clear simplex structure associated with the SVD of the data matrices, which largely validates our discovery.Comment: 73 pages, 8 figures, 6 tables; considered two different VH algorithm, OVH and GVH, and provided theoretical analysis for each algorithm; re-organized upper bound theory part; added the subsection of comparing error rate with other existing methods; provided another improved version of error analysis through Bernstein inequality for martingale

    Formal Verification of Probabilistic SystemC Models with Statistical Model Checking

    Full text link
    Transaction-level modeling with SystemC has been very successful in describing the behavior of embedded systems by providing high-level executable models, in which many of them have inherent probabilistic behaviors, e.g., random data and unreliable components. It thus is crucial to have both quantitative and qualitative analysis of the probabilities of system properties. Such analysis can be conducted by constructing a formal model of the system under verification and using Probabilistic Model Checking (PMC). However, this method is infeasible for large systems, due to the state space explosion. In this article, we demonstrate the successful use of Statistical Model Checking (SMC) to carry out such analysis directly from large SystemC models and allow designers to express a wide range of useful properties. The first contribution of this work is a framework to verify properties expressed in Bounded Linear Temporal Logic (BLTL) for SystemC models with both timed and probabilistic characteristics. Second, the framework allows users to expose a rich set of user-code primitives as atomic propositions in BLTL. Moreover, users can define their own fine-grained time resolution rather than the boundary of clock cycles in the SystemC simulation. The third contribution is an implementation of a statistical model checker. It contains an automatic monitor generation for producing execution traces of the model-under-verification (MUV), the mechanism for automatically instrumenting the MUV, and the interaction with statistical model checking algorithms.Comment: Journal of Software: Evolution and Process. Wiley, 2017. arXiv admin note: substantial text overlap with arXiv:1507.0818

    eRPCAe^{\text{RPCA}}: Robust Principal Component Analysis for Exponential Family Distributions

    Full text link
    Robust Principal Component Analysis (RPCA) is a widely used method for recovering low-rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low-rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non-Gaussian. We thus propose a new method called Robust Principal Component Analysis for Exponential Family distributions (eRPCAe^{\text{RPCA}}), which can perform the desired decomposition into low-rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient eRPCAe^{\text{RPCA}} decomposition. The effectiveness of eRPCAe^{\text{RPCA}} is then demonstrated in two applications: the first for steel sheet defect detection, and the second for crime activity monitoring in the Atlanta metropolitan area

    Feature extraction for document image segmentation by pLSA model

    Get PDF
    In this paper, we propose a method for document image segmentation based on pLSA (probabilistic latent semantic analysis) model. The pLSA model is originally developed for topic discovery in text analysis using "bag-of-words" document representation. The model is useful for image analysis by "bag-of-visual words" image representation. The performance of the method depends on the visual vocabulary generated by feature extraction from the document image. We compare several feature extraction and description methods, and examine the relations to segmentation performance. Through the experiments, we show accurate content-based document segmentation is made possible by using pLSA-based method.ArticleThe Eighth IAPR Workshop on Document Analysis Systemsconference pape

    Temporal analysis of text data using latent variable models

    Get PDF
    Detecting and tracking of temporal data is an important task in multiple applications. In this paper we study temporal text mining methods for Music Information Retrieval. We compare two ways of detecting the temporal latent semantics of a corpus extracted from Wikipedia, using a stepwise Probabilistic Latent Semantic Analysis (PLSA) approach and a global multiway PLSA method. The analysis indicates that the global analysis method is able to identify relevant trends which are difficult to get using a step-by-step approach. Furthermore we show that inspection of PLSA models with different number of factors may reveal the stability of temporal clusters making it possible to choose the relevant number of factors. 1
    • …
    corecore