795 research outputs found

    Robust Near-Separable Nonnegative Matrix Factorization Using Linear Optimization

    Full text link
    Nonnegative matrix factorization (NMF) has been shown recently to be tractable under the separability assumption, under which all the columns of the input data matrix belong to the convex cone generated by only a few of these columns. Bittorf, Recht, R\'e and Tropp (`Factoring nonnegative matrices with linear programs', NIPS 2012) proposed a linear programming (LP) model, referred to as Hottopixx, which is robust under any small perturbation of the input matrix. However, Hottopixx has two important drawbacks: (i) the input matrix has to be normalized, and (ii) the factorization rank has to be known in advance. In this paper, we generalize Hottopixx in order to resolve these two drawbacks, that is, we propose a new LP model which does not require normalization and detects the factorization rank automatically. Moreover, the new LP model is more flexible, significantly more tolerant to noise, and can easily be adapted to handle outliers and other noise models. Finally, we show on several synthetic datasets that it outperforms Hottopixx while competing favorably with two state-of-the-art methods.Comment: 27 page; 4 figures. New Example, new experiment on the Swimmer data se

    A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem

    Full text link
    In this paper, we consider the sparse eigenvalue problem wherein the goal is to obtain a sparse solution to the generalized eigenvalue problem. We achieve this by constraining the cardinality of the solution to the generalized eigenvalue problem and obtain sparse principal component analysis (PCA), sparse canonical correlation analysis (CCA) and sparse Fisher discriminant analysis (FDA) as special cases. Unlike the â„“1\ell_1-norm approximation to the cardinality constraint, which previous methods have used in the context of sparse PCA, we propose a tighter approximation that is related to the negative log-likelihood of a Student's t-distribution. The problem is then framed as a d.c. (difference of convex functions) program and is solved as a sequence of convex programs by invoking the majorization-minimization method. The resulting algorithm is proved to exhibit \emph{global convergence} behavior, i.e., for any random initialization, the sequence (subsequence) of iterates generated by the algorithm converges to a stationary point of the d.c. program. The performance of the algorithm is empirically demonstrated on both sparse PCA (finding few relevant genes that explain as much variance as possible in a high-dimensional gene dataset) and sparse CCA (cross-language document retrieval and vocabulary selection for music retrieval) applications.Comment: 40 page

    Facts and Fabrications about Ebola: A Twitter Based Study

    Full text link
    Microblogging websites like Twitter have been shown to be immensely useful for spreading information on a global scale within seconds. The detrimental effect, however, of such platforms is that misinformation and rumors are also as likely to spread on the network as credible, verified information. From a public health standpoint, the spread of misinformation creates unnecessary panic for the public. We recently witnessed several such scenarios during the outbreak of Ebola in 2014 [14, 1]. In order to effectively counter the medical misinformation in a timely manner, our goal here is to study the nature of such misinformation and rumors in the United States during fall 2014 when a handful of Ebola cases were confirmed in North America. It is a well known convention on Twitter to use hashtags to give context to a Twitter message (a tweet). In this study, we collected approximately 47M tweets from the Twitter streaming API related to Ebola. Based on hashtags, we propose a method to classify the tweets into two sets: credible and speculative. We analyze these two sets and study how they differ in terms of a number of features extracted from the Twitter API. In conclusion, we infer several interesting differences between the two sets. We outline further potential directions to using this material for monitoring and separating speculative tweets from credible ones, to enable improved public health information.Comment: Appears in SIGKDD BigCHat Workshop 201
    • …
    corecore