4 research outputs found
Projection Based Models for High Dimensional Data
In recent years, many machine learning applications have arisen which deal with the
problem of finding patterns in high dimensional data. Principal component analysis
(PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction
by estimating latent factors which minimise the reconstruction error between
the original data and its low-dimensional projection. We initially consider a situation
where influential observations exist within the dataset which have a large,
adverse affect on the estimated PCA model. We propose a measure of “predictive
influence” to detect these points based on the contribution of each point to the
leave-one-out reconstruction error of the model using an analytic PRedicted REsidual
Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA
to deal with the presence of influential observations and outliers which minimizes
the predictive reconstruction error.
In some applications there may be unobserved clusters in the data, for which
fitting PCA models to subsets of the data would provide a better fit. This is known
as the subspace clustering problem. We develop a novel algorithm for subspace
clustering which iteratively fits PCA models to subsets of the data and assigns observations
to clusters based on their predictive influence on the reconstruction error.
We study the convergence of the algorithm and compare its performance to a number
of subspace clustering methods on simulated data and in real applications from
computer vision involving clustering object trajectories in video sequences and images
of faces.
We extend our predictive clustering framework to a setting where two high-dimensional
views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover
clusters which are maximally predictive between the views. In this setting two block
partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality
reduction in both views by estimating latent factors that are highly predictive. We
fit TB-PLS models to subsets of data and assign points to clusters based on their
predictive influence under each model which is evaluated using a PRESS statistic.
We compare our method to state of the art algorithms in real applications in webpage
and document clustering and find that our approach to predictive clustering
yields superior results.
Finally, we propose a method for dynamically tracking multivariate data streams
based on PLS. Our method learns a linear regression function from multivariate
input and output streaming data in an incremental fashion while also performing
dimensionality reduction and variable selection. Moreover, the recursive regression
model is able to adapt to sudden changes in the data generating mechanism and also
identifies the number of latent factors. We apply our method to the enhanced index
tracking problem in computational finance
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-art-works
Automatic Adaptation of SOA Systems Supported by Machine Learning
Part 3: Service OrientationInternational audienceRecent advances in the development of information systems have led to increased complexity and cost in terms of the required maintenance and management. On the other hand, systems built in accordance with modern architectural paradigms, such as Service Oriented Architecture (SOA), posses features enabling extensive adaptation, not present in traditional systems. Automatic adaptation mechanisms can be used to facilitate system management. The goal of this work is to show that automatic adaptation can be effectively implemented in SOA systems using machine learning algorithms. The presented concept relies on a combination of clustering and reinforcement learning algorithms. The paper discusses assumptions which are necessary to apply machine learning algorithms to automatic adaptation of SOA systems, and presents a machine learning-based management framework prototype. Possible benefits and disadvantages of the presented approach are discussed and the approach itself is validated with a representative case study