348 research outputs found
Orthogonal NMF through Subspace Exploration
Abstract Orthogonal Nonnegative Matrix Factorization (ONMF) aims to approximate a nonnegative matrix as the product of two k-dimensional nonnegative factors, one of which has orthonormal columns. It yields potentially useful data representations as superposition of disjoint parts, while it has been shown to work well for clustering tasks where traditional methods underperform. Existing algorithms rely mostly on heuristics, which despite their good empirical performance, lack provable performance guarantees. We present a new ONMF algorithm with provable approximation guarantees. For any constant dimension k, we obtain an additive EPTAS without any assumptions on the input. Our algorithm relies on a novel approximation to the related Nonnegative Principal Component Analysis (NNPCA) problem; given an arbitrary data matrix, NNPCA seeks k nonnegative components that jointly capture most of the variance. Our NNPCA algorithm is of independent interest and generalizes previous work that could only obtain guarantees for a single component. We evaluate our algorithms on several real and synthetic datasets and show that their performance matches or outperforms the state of the art
Hyperspectral Unmixing Overview: Geometrical, Statistical, and Sparse Regression-Based Approaches
Imaging spectrometers measure electromagnetic energy scattered in their
instantaneous field view in hundreds or thousands of spectral channels with
higher spectral resolution than multispectral cameras. Imaging spectrometers
are therefore often referred to as hyperspectral cameras (HSCs). Higher
spectral resolution enables material identification via spectroscopic analysis,
which facilitates countless applications that require identifying materials in
scenarios unsuitable for classical spectroscopic analysis. Due to low spatial
resolution of HSCs, microscopic material mixing, and multiple scattering,
spectra measured by HSCs are mixtures of spectra of materials in a scene. Thus,
accurate estimation requires unmixing. Pixels are assumed to be mixtures of a
few materials, called endmembers. Unmixing involves estimating all or some of:
the number of endmembers, their spectral signatures, and their abundances at
each pixel. Unmixing is a challenging, ill-posed inverse problem because of
model inaccuracies, observation noise, environmental conditions, endmember
variability, and data set size. Researchers have devised and investigated many
models searching for robust, stable, tractable, and accurate unmixing
algorithms. This paper presents an overview of unmixing methods from the time
of Keshava and Mustard's unmixing tutorial [1] to the present. Mixing models
are first discussed. Signal-subspace, geometrical, statistical, sparsity-based,
and spatial-contextual unmixing algorithms are described. Mathematical problems
and potential solutions are described. Algorithm characteristics are
illustrated experimentally.Comment: This work has been accepted for publication in IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensin
Data Clustering And Visualization Through Matrix Factorization
Clustering is traditionally an unsupervised task which is to find natural groupings or clusters in multidimensional data based on perceived similarities among the patterns. The purpose of clustering is to extract useful information
from unlabeled data.
In order to present the extracted useful knowledge obtained by clustering in a meaningful way, data visualization becomes a popular and growing area of research field. Visualization can provide a qualitative overview of large and complex data sets, which help us the desired insight in truly understanding the phenomena of interest in data.
The contribution of this dissertation is two-fold: Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for data clustering/co-clustering and Exemplar-based data Visualization (EV) through matrix factorization. Compared to traditional data mining models,
matrix-based methods are fast, easy to understand and implement, especially suitable to solve large-scale challenging problems in text mining, image grouping, medical diagnosis, and bioinformatics.
In this dissertation, we present two effective matrix-based solutions
in the new directions of data clustering and visualization.
First, in many practical learning domains,
there is a large supply of unlabeled data but limited labeled data, and in most cases it might
be expensive to generate large amounts of labeled data. Traditional clustering algorithms completely ignore these valuable labeled data and thus are inapplicable to these problems. Consequently, semi-supervised clustering, which can incorporate the domain knowledge to guide a clustering algorithm, has become a topic of significant recent interest.
Thus, we develop a Non-negative Matrix Factorization
(NMF) based framework to incorporate prior knowledge into data clustering. Moreover, with the fast growth of Internet and computational technologies in the past decade, many data mining applications have advanced swiftly from the simple clustering of one data type to the co-clustering of multiple data types, usually involving high heterogeneity. To this end, we extend SS-NMF to perform heterogeneous data co-clustering. From a theoretical perspective, SS-NMF for data clustering/co-clustering is mathematically rigorous. The convergence and correctness of our algorithms are proved.
In addition, we discuss the relationship between SS-NMF with other well-known clustering and co-clustering models.
Second, most of current clustering models only provide the centroids (e.g., mathematical means of the clusters)
without inferring the representative exemplars from real data, thus they are unable to better summarize or visualize the raw data.
A new method, Exemplar-based Visualization (EV), is proposed to cluster and visualize an extremely large-scale data.
Capitalizing on recent advances in matrix approximation and factorization, EV provides a means
to visualize large scale data with high accuracy (in
retaining neighbor relations), high efficiency (in computation), and
high flexibility (through the use of exemplars).
Empirically, we demonstrate the superior performance of our matrix-based data clustering and visualization models
through extensive experiments performed on the publicly available large scale data sets
Mineral identification using data-mining in hyperspectral infrared imagery
Les applications de lâimagerie infrarouge dans le domaine de la gĂ©ologie sont principalement des applications hyperspectrales. Elles permettent entre autre lâidentification minĂ©rale, la cartographie, ainsi que lâestimation de la portĂ©e. Le plus souvent, ces acquisitions sont rĂ©alisĂ©es in-situ soit Ă lâaide de capteurs aĂ©roportĂ©s, soit Ă lâaide de dispositifs portatifs. La dĂ©couverte de minĂ©raux indicateurs a permis dâamĂ©liorer grandement lâexploration minĂ©rale. Ceci est en partie dĂ» Ă lâutilisation dâinstruments portatifs. Dans ce contexte le dĂ©veloppement de systĂšmes automatisĂ©s permettrait dâaugmenter Ă la fois la qualitĂ© de lâexploration et la prĂ©cision de la dĂ©tection des indicateurs. Câest dans ce cadre que sâinscrit le travail menĂ© dans ce doctorat. Le sujet consistait en lâutilisation de mĂ©thodes dâapprentissage automatique appliquĂ©es Ă lâanalyse (au traitement) dâimages hyperspectrales prises dans les longueurs dâonde infrarouge. Lâobjectif recherchĂ© Ă©tant lâidentification de grains minĂ©raux de petites tailles utilisĂ©s comme indicateurs minĂ©ral -ogiques. Une application potentielle de cette recherche serait le dĂ©veloppement dâun outil logiciel dâassistance pour lâanalyse des Ă©chantillons lors de lâexploration minĂ©rale. Les expĂ©riences ont Ă©tĂ© menĂ©es en laboratoire dans la gamme relative Ă lâinfrarouge thermique (Long Wave InfraRed, LWIR) de 7.7m Ă 11.8 m. Ces essais ont permis de proposer une mĂ©thode pour calculer lâannulation du continuum. La mĂ©thode utilisĂ©e lors de ces essais utilise la factorisation matricielle non nĂ©gative (NMF). En utlisant une factorisation du premier ordre on peut dĂ©duire le rayonnement de pĂ©nĂ©tration, lequel peut ensuite ĂȘtre comparĂ© et analysĂ© par rapport Ă dâautres mĂ©thodes plus communes. Lâanalyse des rĂ©sultats spectraux en comparaison avec plusieurs bibliothĂšques existantes de donnĂ©es a permis de mettre en Ă©vidence la suppression du continuum. Les expĂ©rience ayant menĂ©s Ă ce rĂ©sultat ont Ă©tĂ© conduites en utilisant une plaque Infragold ainsi quâun objectif macro LWIR. Lâidentification automatique de grains de diffĂ©rents matĂ©riaux tels que la pyrope, lâolivine et le quartz a commencĂ©. Lors dâune phase de comparaison entre des approches supervisĂ©es et non supervisĂ©es, cette derniĂšre sâest montrĂ©e plus appropriĂ© en raison du comportement indĂ©pendant par rapport Ă lâĂ©tape dâentraĂźnement. Afin de confirmer la qualitĂ© de ces rĂ©sultats quatre expĂ©riences ont Ă©tĂ© menĂ©es. Lors dâune premiĂšre expĂ©rience deux algorithmes ont Ă©tĂ© Ă©valuĂ©s pour application de regroupements en utilisant lâapproche FCC (False Colour Composite). Cet essai a permis dâobserver une vitesse de convergence, jusquâa vingt fois plus rapide, ainsi quâune efficacitĂ© significativement accrue concernant lâidentification en comparaison des rĂ©sultats de la littĂ©rature. Cependant des essais effectuĂ©s sur des donnĂ©es LWIR ont montrĂ© un manque de prĂ©diction de la surface du grain lorsque les grains Ă©taient irrĂ©guliers avec prĂ©sence dâagrĂ©gats minĂ©raux. La seconde expĂ©rience a consistĂ©, en une analyse quantitaive comparative entre deux bases de donnĂ©es de Ground Truth (GT), nommĂ©e rigid-GT et observed-GT (rigide-GT: Ă©tiquet manuel de la rĂ©gion, observĂ©e-GT:Ă©tiquetage manuel les pixels). La prĂ©cision des rĂ©sultats Ă©tait 1.5 fois meilleur lorsque lâon a utlisĂ© la base de donnĂ©es observed-GT que rigid-GT. Pour les deux derniĂšres epxĂ©rience, des donnĂ©es venant dâun MEB (Microscope Ălectronique Ă Balayage) ainsi que dâun microscopie Ă fluorescence (XRF) ont Ă©tĂ© ajoutĂ©es. Ces donnĂ©es ont permis dâintroduire des informations relatives tant aux agrĂ©gats minĂ©raux quâĂ la surface des grains. Les rĂ©sultats ont Ă©tĂ© comparĂ©s par des techniques dâidentification automatique des minĂ©raux, utilisant ArcGIS. Cette derniĂšre a montrĂ© une performance prometteuse quand Ă lâidentification automatique et Ă aussi Ă©tĂ© utilisĂ©e pour la GT de validation. Dans lâensemble, les quatre mĂ©thodes de cette thĂšse reprĂ©sentent des mĂ©thodologies bĂ©nĂ©fiques pour lâidentification des minĂ©raux. Ces mĂ©thodes prĂ©sentent lâavantage dâĂȘtre non-destructives, relativement prĂ©cises et dâavoir un faible coĂ»t en temps calcul ce qui pourrait les qualifier pour ĂȘtre utilisĂ©e dans des conditions de laboratoire ou sur le terrain.The geological applications of hyperspectral infrared imagery mainly consist in mineral identification, mapping, airborne or portable instruments, and core logging. Finding the mineral indicators offer considerable benefits in terms of mineralogy and mineral exploration which usually involves application of portable instrument and core logging. Moreover, faster and more mechanized systems development increases the precision of identifying mineral indicators and avoid any possible mis-classification. Therefore, the objective of this thesis was to create a tool to using hyperspectral infrared imagery and process the data through image analysis and machine learning methods to identify small size mineral grains used as mineral indicators. This system would be applied for different circumstances to provide an assistant for geological analysis and mineralogy exploration. The experiments were conducted in laboratory conditions in the long-wave infrared (7.7ÎŒm to 11.8ÎŒm - LWIR), with a LWIR-macro lens (to improve spatial resolution), an Infragold plate, and a heating source. The process began with a method to calculate the continuum removal. The approach is the application of Non-negative Matrix Factorization (NMF) to extract Rank-1 NMF and estimate the down-welling radiance and then compare it with other conventional methods. The results indicate successful suppression of the continuum from the spectra and enable the spectra to be compared with spectral libraries. Afterwards, to have an automated system, supervised and unsupervised approaches have been tested for identification of pyrope, olivine and quartz grains. The results indicated that the unsupervised approach was more suitable due to independent behavior against training stage. Once these results obtained, two algorithms were tested to create False Color Composites (FCC) applying a clustering approach. The results of this comparison indicate significant computational efficiency (more than 20 times faster) and promising performance for mineral identification. Finally, the reliability of the automated LWIR hyperspectral infrared mineral identification has been tested and the difficulty for identification of the irregular grainâs surface along with the mineral aggregates has been verified. The results were compared to two different Ground Truth(GT) (i.e. rigid-GT and observed-GT) for quantitative calculation. Observed-GT increased the accuracy up to 1.5 times than rigid-GT. The samples were also examined by Micro X-ray Fluorescence (XRF) and Scanning Electron Microscope (SEM) in order to retrieve information for the mineral aggregates and the grainâs surface (biotite, epidote, goethite, diopside, smithsonite, tourmaline, kyanite, scheelite, pyrope, olivine, and quartz). The results of XRF imagery compared with automatic mineral identification techniques, using ArcGIS, and represented a promising performance for automatic identification and have been used for GT validation. In overall, the four methods (i.e. 1.Continuum removal methods; 2. Classification or clustering methods for mineral identification; 3. Two algorithms for clustering of mineral spectra; 4. Reliability verification) in this thesis represent beneficial methodologies to identify minerals. These methods have the advantages to be a non-destructive, relatively accurate and have low computational complexity that might be used to identify and assess mineral grains in the laboratory conditions or in the field
Function Space Tensor Decomposition and its Application in Sports Analytics
Recent advancements in sports information and technology systems have ushered in a new age of applications of both supervised and unsupervised analytical techniques in the sports domain. These automated systems capture large volumes of data points about competitors during live competition. As a result, multi-relational analyses are gaining popularity in the field of Sports Analytics. We review two case studies of dimensionality reduction with Principal Component Analysis and latent factor analysis with Non-Negative Matrix Factorization applied in sports. Also, we provide a review of a framework for extending these techniques for higher order data structures. The primary scope of this thesis is to further extend the concept of tensor decomposition through the use of function spaces. In doing so, we address the limitations of PCA to vector and matrix representations and the CP-Decomposition to tensor representations. Lastly, we provide an application in the context of professional stock car racing
Tensor Decompositions for Signal Processing Applications From Two-way to Multiway Component Analysis
The widespread use of multi-sensor technology and the emergence of big
datasets has highlighted the limitations of standard flat-view matrix models
and the necessity to move towards more versatile data analysis tools. We show
that higher-order tensors (i.e., multiway arrays) enable such a fundamental
paradigm shift towards models that are essentially polynomial and whose
uniqueness, unlike the matrix methods, is guaranteed under verymild and natural
conditions. Benefiting fromthe power ofmultilinear algebra as theirmathematical
backbone, data analysis techniques using tensor decompositions are shown to
have great flexibility in the choice of constraints that match data properties,
and to find more general latent components in the data than matrix-based
methods. A comprehensive introduction to tensor decompositions is provided from
a signal processing perspective, starting from the algebraic foundations, via
basic Canonical Polyadic and Tucker models, through to advanced cause-effect
and multi-view data analysis schemes. We show that tensor decompositions enable
natural generalizations of some commonly used signal processing paradigms, such
as canonical correlation and subspace techniques, signal separation, linear
regression, feature extraction and classification. We also cover computational
aspects, and point out how ideas from compressed sensing and scientific
computing may be used for addressing the otherwise unmanageable storage and
manipulation problems associated with big datasets. The concepts are supported
by illustrative real world case studies illuminating the benefits of the tensor
framework, as efficient and promising tools for modern signal processing, data
analysis and machine learning applications; these benefits also extend to
vector/matrix data through tensorization. Keywords: ICA, NMF, CPD, Tucker
decomposition, HOSVD, tensor networks, Tensor Train
- âŠ