12 research outputs found

    Discriminating Variable Star Candidates in Large Image Databases from the HiTS Survey Using NMF

    Get PDF
    AbstractNew instruments and technologies are allowing the acquisition of large amounts of data from astronomical surveys. Nowadays there is a pressing need for autonomous methods to discriminate the interesting astronomical objects in the vast sky. The High Cadence Transient Survey (HiTS) project is an astronomical survey that is trying to find a rare transient event that occurs during the first instants of a supernova. In this paper we propose an autonomous method to discriminate stellar variability from the HiTS database, that uses a feature extraction scheme based on Non-negative matrix factorization (NMF). Using NMF, dictionaries of image prototypes that represent the data in a compact way are obtained. The projections of the dataset into these dictionaries are fed into a random forest classifier. NMF is compared with other feature extraction schemes, on a subset of 500,000 transient candidates from the HiTS survey. With NMF a better class separability at feature level is obtained which enhances the classification accuracy significantly. Using the NMF features less than 4% of the true stellar transients are lost, at a manageable false positive rate of 0.1%

    Scalable and distributed constrained low rank approximations

    Get PDF
    Low rank approximation is the problem of finding two low rank factors W and H such that the rank(WH) << rank(A) and A ≈ WH. These low rank factors W and H can be constrained for meaningful physical interpretation and referred as Constrained Low Rank Approximation (CLRA). Like most of the constrained optimization problem, performing CLRA can be computationally expensive than its unconstrained counterpart. A widely used CLRA is the Non-negative Matrix Factorization (NMF) which enforces non-negativity constraints in each of its low rank factors W and H. In this thesis, I focus on scalable/distributed CLRA algorithms for constraints such as boundedness and non-negativity for large real world matrices that includes text, High Definition (HD) video, social networks and recommender systems. First, I begin with the Bounded Matrix Low Rank Approximation (BMA) which imposes a lower and an upper bound on every element of the lower rank matrix. BMA is more challenging than NMF as it imposes bounds on the product WH rather than on each of the low rank factors W and H. For very large input matrices, we extend our BMA algorithm to Block BMA that can scale to a large number of processors. In applications, such as HD video, where the input matrix to be factored is extremely large, distributed computation is inevitable and the network communication becomes a major performance bottleneck. Towards this end, we propose a novel distributed Communication Avoiding NMF (CANMF) algorithm that communicates only the right low rank factor to its neighboring machine. Finally, a general distributed HPC- NMF framework that uses HPC techniques in communication intensive NMF operations and suitable for broader class of NMF algorithms.Ph.D

    Les Stratégies de Partitionnement et de Communication pour Factorisation des Matrices Non-négatives Creuses

    Get PDF
    Non-negative matrix factorization (NMF), the problem of finding two non-negative low-rank factors whose product approximates an input matrix, is a useful tool for many data mining and scientific applications such as topic modeling in text mining and blind source separation in microscopy.In this paper, we focus on scaling algorithms for NMF to very large sparse datasets and massively parallel machines by employing effective algorithms, communication patterns, and partitioning schemes that leverage the sparsity of the input matrix. In the case of machine learning workflow, the computations after SpMM must deal with dense matrices, as Sparse-Dense matrix multiplication will result in a dense matrix. Hence, the partitioning strategy considering only SpMM will result in a huge imbalance in the overall workflow especially on computations after SpMM and in this specific case of NMF on non-negative least squares computations. Towards this, we consider two previous works developed for related problems, one that uses a fine-grained partitioning strategy using a point-to-point communication pattern and on that uses a checkerboard partitioning strategy using a collective-based communication pattern.We show that a combination of the previous approaches balances the demands of the various computations within NMF algorithms and achieves high efficiency and scalability. From the experiments, we could see that our proposed algorithm communicates atleast 4x less than the collective and achieves upto 100x speed up over the baseline FAUN on real world datasets. Our algorithm was experimented in two different super computing platforms and we could scale up to 32000 processors on Bluegene/Q.La factorisation de matrice non-négative (NMF), le problème de trouver deux facteurs de rang faible non négatifs dont le produit se rapproche d'une matrice d'entrée, est un outil utile pour de nombreuses applications scientifiques et d'exploration de données telles que la modélisation de textes et la séparation de signaux en microscopie.Dans cet article, nous etudions les algorithmes passant à l'échelle pour NMF à de très grands ensembles de données creuses et des machines massivement parallèles en utilisant des algorithmes efficaces, des modèles de communication et des schémas de partitionnement qui exploitent la structure creuse de la matrice.Dans le cadre de cet algorithme, les calculs après SpMM doivent traiter des matrices denses, car la multiplication SpMM produira une matrice dense.Par conséquent, la stratégie de partitionnement ne prenant en compte que SpMM entraînera un déséquilibre énorme dans l'algorithme global, en particulier sur les calculs après SpMM et dans ce cas spécifique de NMF sur les calculs de moindres carrés non négatifs.À cet égard, nous considérons deux travaux antérieurs développés pour des problèmes connexes, l'un utilisant une stratégie de partitionnement de granularité ffine utilisant un modèle de communication ``point-to-point'' et utilisant une stratégie de partitionnement en damier utilisant un modèle de communication collectif.Nous montrons qu'une combinaison des approches précédentes permet d'équilibrer les exigences des divers calculs au sein des algorithmes NMF et permet d'obtenir une efficacité et une évolutivité élevées. À partir des expériences, nous avons constaté que notre algorithme proposé communique au moins4x moins que le collectif et atteint jusqu'à 100 fois la vitesse de base sur les jeux de données réels. Notre algorithme a été expérimenté sur deux plates-formes superinformatiques différentes et nous avons pu passer à 32 000 processeurs sur Bluegene / Q
    corecore