5,390 research outputs found
Projection Based Models for High Dimensional Data
In recent years, many machine learning applications have arisen which deal with the
problem of finding patterns in high dimensional data. Principal component analysis
(PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction
by estimating latent factors which minimise the reconstruction error between
the original data and its low-dimensional projection. We initially consider a situation
where influential observations exist within the dataset which have a large,
adverse affect on the estimated PCA model. We propose a measure of “predictive
influence” to detect these points based on the contribution of each point to the
leave-one-out reconstruction error of the model using an analytic PRedicted REsidual
Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA
to deal with the presence of influential observations and outliers which minimizes
the predictive reconstruction error.
In some applications there may be unobserved clusters in the data, for which
fitting PCA models to subsets of the data would provide a better fit. This is known
as the subspace clustering problem. We develop a novel algorithm for subspace
clustering which iteratively fits PCA models to subsets of the data and assigns observations
to clusters based on their predictive influence on the reconstruction error.
We study the convergence of the algorithm and compare its performance to a number
of subspace clustering methods on simulated data and in real applications from
computer vision involving clustering object trajectories in video sequences and images
of faces.
We extend our predictive clustering framework to a setting where two high-dimensional
views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover
clusters which are maximally predictive between the views. In this setting two block
partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality
reduction in both views by estimating latent factors that are highly predictive. We
fit TB-PLS models to subsets of data and assign points to clusters based on their
predictive influence under each model which is evaluated using a PRESS statistic.
We compare our method to state of the art algorithms in real applications in webpage
and document clustering and find that our approach to predictive clustering
yields superior results.
Finally, we propose a method for dynamically tracking multivariate data streams
based on PLS. Our method learns a linear regression function from multivariate
input and output streaming data in an incremental fashion while also performing
dimensionality reduction and variable selection. Moreover, the recursive regression
model is able to adapt to sudden changes in the data generating mechanism and also
identifies the number of latent factors. We apply our method to the enhanced index
tracking problem in computational finance
Tensor Analysis and Fusion of Multimodal Brain Images
Current high-throughput data acquisition technologies probe dynamical systems
with different imaging modalities, generating massive data sets at different
spatial and temporal resolutions posing challenging problems in multimodal data
fusion. A case in point is the attempt to parse out the brain structures and
networks that underpin human cognitive processes by analysis of different
neuroimaging modalities (functional MRI, EEG, NIRS etc.). We emphasize that the
multimodal, multi-scale nature of neuroimaging data is well reflected by a
multi-way (tensor) structure where the underlying processes can be summarized
by a relatively small number of components or "atoms". We introduce
Markov-Penrose diagrams - an integration of Bayesian DAG and tensor network
notation in order to analyze these models. These diagrams not only clarify
matrix and tensor EEG and fMRI time/frequency analysis and inverse problems,
but also help understand multimodal fusion via Multiway Partial Least Squares
and Coupled Matrix-Tensor Factorization. We show here, for the first time, that
Granger causal analysis of brain networks is a tensor regression problem, thus
allowing the atomic decomposition of brain networks. Analysis of EEG and fMRI
recordings shows the potential of the methods and suggests their use in other
scientific domains.Comment: 23 pages, 15 figures, submitted to Proceedings of the IEE
Sparse Learning for Variable Selection with Structures and Nonlinearities
In this thesis we discuss machine learning methods performing automated
variable selection for learning sparse predictive models. There are multiple
reasons for promoting sparsity in the predictive models. By relying on a
limited set of input variables the models naturally counteract the overfitting
problem ubiquitous in learning from finite sets of training points. Sparse
models are cheaper to use for predictions, they usually require lower
computational resources and by relying on smaller sets of inputs can possibly
reduce costs for data collection and storage. Sparse models can also contribute
to better understanding of the investigated phenomenons as they are easier to
interpret than full models.Comment: PhD thesi
Wavelength Selection Method Based on Partial Least Square from Hyperspectral Unmanned Aerial Vehicle Orthomosaic of Irrigated Olive Orchards
Identifying and mapping irrigated areas is essential for a variety of applications such as agricultural planning and water resource management. Irrigated plots are mainly identified using supervised classification of multispectral images from satellite or manned aerial platforms. Recently, hyperspectral sensors on-board Unmanned Aerial Vehicles (UAV) have proven to be useful analytical tools in agriculture due to their high spectral resolution. However, few efforts have been made to identify which wavelengths could be applied to provide relevant information in specific scenarios. In this study, hyperspectral reflectance data from UAV were used to compare the performance of several wavelength selection methods based on Partial Least Square (PLS) regression with the purpose of discriminating two systems of irrigation commonly used in olive orchards. The tested PLS methods include filter methods (Loading Weights, Regression Coefficient and Variable Importance in Projection); Wrapper methods (Genetic Algorithm-PLS, Uninformative Variable Elimination-PLS, Backward Variable Elimination-PLS, Sub-window Permutation Analysis-PLS, Iterative Predictive Weighting-PLS, Regularized Elimination Procedure-PLS, Backward Interval-PLS, Forward Interval-PLS and Competitive Adaptive Reweighted Sampling-PLS); and an Embedded method (Sparse-PLS). In addition, two non-PLS based methods, Lasso and Boruta, were also used. Linear Discriminant Analysis and nonlinear K-Nearest Neighbors techniques were established for identification and assessment. The results indicate that wavelength selection methods, commonly used in other disciplines, provide utility in remote sensing for agronomical purposes, the identification of irrigation techniques being one such example. In addition to the aforementioned, these PLS and non-PLS based methods can play an important role in multivariate analysis, which can be used for subsequent model analysis. Of all the methods evaluated, Genetic Algorithm-PLS and Boruta eliminated nearly 90% of the original spectral wavelengths acquired from a hyperspectral sensor onboard a UAV while increasing the identification accuracy of the classification
Distributed Supervised Statistical Learning
We live in the era of big data, nowadays, many companies face data of massive size that, in most cases, cannot be stored and processed on a single computer. Often such data has to be distributed over multiple computers which then makes the storage, pre-processing, and data analysis possible in practice. In the age of big data, distributed learning has gained popularity as a method to manage enormous datasets. In this thesis, we focus on distributed supervised statistical learning where sparse linear regression analysis is performed in a distributed framework. These methods are frequently applied in a variety of disciplines tackling large scale datasets analysis, including engineering, economics, and finance. In distributed learning, one key question is, for example, how to efficiently aggregate multiple estimators that are obtained based on data subsets stored on multiple computers. We investigate recent studies on distributed statistical inferences. There have been many efforts to propose efficient ways of aggregating local estimates, most popular methods are discussed in this thesis. Recently, an important question about the number of machines to deploy is addressed for several estimation methods, notable answers to the question are reviewed in this literature. We have considered a specific class of Liu-type shrinkage estimation methods for distributed statistical inference. We also conduct a Monte Carlo simulation study to assess performance of the Liu-type shrinkage estimation methods in a distributed framework
- …