6,073 research outputs found
Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods
Feature extraction and dimensionality reduction are important tasks in many
fields of science dealing with signal processing and analysis. The relevance of
these techniques is increasing as current sensory devices are developed with
ever higher resolution, and problems involving multimodal data sources become
more common. A plethora of feature extraction methods are available in the
literature collectively grouped under the field of Multivariate Analysis (MVA).
This paper provides a uniform treatment of several methods: Principal Component
Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis
(CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions
derived by means of the theory of reproducing kernel Hilbert spaces. We also
review their connections to other methods for classification and statistical
dependence estimation, and introduce some recent developments to deal with the
extreme cases of large-scale and low-sized problems. To illustrate the wide
applicability of these methods in both classification and regression problems,
we analyze their performance in a benchmark of publicly available data sets,
and pay special attention to specific real applications involving audio
processing for music genre prediction and hyperspectral satellite images for
Earth and climate monitoring
Tensor Decompositions for Signal Processing Applications From Two-way to Multiway Component Analysis
The widespread use of multi-sensor technology and the emergence of big
datasets has highlighted the limitations of standard flat-view matrix models
and the necessity to move towards more versatile data analysis tools. We show
that higher-order tensors (i.e., multiway arrays) enable such a fundamental
paradigm shift towards models that are essentially polynomial and whose
uniqueness, unlike the matrix methods, is guaranteed under verymild and natural
conditions. Benefiting fromthe power ofmultilinear algebra as theirmathematical
backbone, data analysis techniques using tensor decompositions are shown to
have great flexibility in the choice of constraints that match data properties,
and to find more general latent components in the data than matrix-based
methods. A comprehensive introduction to tensor decompositions is provided from
a signal processing perspective, starting from the algebraic foundations, via
basic Canonical Polyadic and Tucker models, through to advanced cause-effect
and multi-view data analysis schemes. We show that tensor decompositions enable
natural generalizations of some commonly used signal processing paradigms, such
as canonical correlation and subspace techniques, signal separation, linear
regression, feature extraction and classification. We also cover computational
aspects, and point out how ideas from compressed sensing and scientific
computing may be used for addressing the otherwise unmanageable storage and
manipulation problems associated with big datasets. The concepts are supported
by illustrative real world case studies illuminating the benefits of the tensor
framework, as efficient and promising tools for modern signal processing, data
analysis and machine learning applications; these benefits also extend to
vector/matrix data through tensorization. Keywords: ICA, NMF, CPD, Tucker
decomposition, HOSVD, tensor networks, Tensor Train
Recommended from our members
Statistical Workflow for Feature Selection in Human Metabolomics Data.
High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations
Statistical process monitoring of a multiphase flow facility
Industrial needs are evolving fast towards more flexible manufacture schemes. As a consequence, it is often required to adapt the plant production to the demand, which can be volatile depending on the application. This is why it is important to develop tools that can monitor the condition of the process working under varying operational conditions. Canonical Variate Analysis (CVA) is a multivariate data driven methodology which has been demonstrated to be superior to other methods, particularly under dynamically changing operational conditions. These comparative studies normally use computer simulated data in benchmark case studies such as the Tennessee Eastman Process Plant (Ricker, N.L. Tennessee Eastman Challenge Archive, Available at 〈http://depts.washington.edu/control/LARRY/TE/download.html〉 Accessed 21.03.2014).
The aim of this work is to provide a benchmark case to demonstrate the ability of different monitoring techniques to detect and diagnose artificially seeded faults in an industrial scale multiphase flow experimental rig. The changing operational conditions, the size and complexity of the test rig make this case study an ideal candidate for a benchmark case that provides a test bed for the evaluation of novel multivariate process monitoring techniques performance using real experimental data. In this paper, the capabilities of CVA to detect and diagnose faults in a real system working under changing operating conditions are assessed and compared with other methodologies. The results obtained demonstrate that CVA can be effectively applied for the detection and diagnosis of faults in real complex systems, and reinforce the idea that the performance of CVA is superior to other algorithms
Incremental online learning in high dimensions
this article, however, is problematic, as it requires a careful selection of initial ridge regression parameters to stabilize the highly rank-deficient full covariance matrix of the input data, and it is easy to create too much bias or too little numerical stabilization initially, which can trap the local distance metric adaptation in local minima.While the LWPR algorithm just computes about a factor 10 times longer for the 20D experiment in comparison to the 2D experiment, RFWR requires a 1000-fold increase of computation time, thus rendering this algorithm unsuitable for high-dimensional regression. In order to compare LWPR's results to other popular regression methods, we evaluated the 2D, 10D, and 20D cross data sets with gaussian process regression (GP) and support vector (SVM) regression in addition to our LWPR method. It should be noted that neither SVM nor GP methods is an incremental method, although they can be considered state-of-the-art for batch regression under relatively small numbers of training data and reasonable input dimensionality. The computational complexity of these methods is prohibitively high for real-time applications. The GP algorithm (Gibbs & MacKay, 1997) used a generic covariance function and optimized over the hyperparameters. The SVM regression was performed using a standard available package (Saunders et al., 1998) and optimized for kernel choices. Figure 6 compares the performance of LWPR and gaussian processes for the above-mentioned data sets using 100, 300, and 500 training data point
Incremental Online Learning in High Dimensions
Locally weighted projection regression (LWPR) is a new algorithm for incremental non-linear function approximation in high dimensional spaces with redundant and irrelevant input dimensions. At its cor
Gradients in urban material composition: A new concept to map cities with spaceborne imaging spectroscopy data
To understand processes in urban environments, such as urban energy fluxes or surface temperature patterns, it is important to map urban surface materials. Airborne imaging spectroscopy data have been successfully used to identify urban surface materials mainly based on unmixing algorithms. Upcoming spaceborne Imaging Spectrometers (IS), such as the Environmental Mapping and Analysis Program (EnMAP), will reduce the time and cost-critical limitations of airborne systems for Earth Observation (EO). However, the spatial resolution of all operated and planned IS in space will not be higher than 20 to 30 m and, thus, the detection of pure Endmember (EM) candidates in urban areas, a requirement for spectral unmixing, is very limited. Gradient analysis could be an alternative method for retrieving urban surface material compositions in pixels from spaceborne IS. The gradient concept is well known in ecology to identify plant species assemblages formed by similar environmental conditions but has never been tested for urban materials. However, urban areas also contain neighbourhoods with similar physical, compositional and structural characteristics. Based on this assumption, this study investigated (1) whether cover fractions of surface materials change gradually in urban areas and (2) whether these gradients can be adequately mapped and interpreted using imaging spectroscopy data (e.g. EnMAP) with 30 m spatial resolution.
Similarities of material compositions were analysed on the basis of 153 systematically distributed samples on a detailed surface material map using Detrended Correspondence Analysis (DCA). Determined gradient scores for the first two gradients were regressed against the corresponding mean reflectance of simulated EnMAP spectra using Partial Least Square regression models. Results show strong correlations with R2 = 0.85 and R2 = 0.71 and an RMSE of 0.24 and 0.21 for the first and second axis, respectively. The subsequent mapping of the first gradient reveals patterns that correspond to the transition from predominantly vegetation classes to the dominance of artificial materials. Patterns resulting from the second gradient are associated with surface material compositions that are related to finer structural differences in urban structures. The composite gradient map shows patterns of common surface material compositions that can be related to urban land use classes such as Urban Structure Types (UST). By linking the knowledge of typical material compositions with urban structures, gradient analysis seems to be a powerful tool to map characteristic material compositions in 30 m imaging spectroscopy data of urban areas
- …