25 research outputs found
Dimensionpudotusmenetelmiä fMRI-analyysissä ja visualisoinnissa
The need to model and understand high-dimensional, noisy data sets is common in many domains these day, among them neuroimaging and fMRI analysis. Dimensionality reduction and variable selection are two common strategies for dealing with high-dimensional data, either as a pre-processing step prior to further analysis, or as an analysis step itself.
This thesis discusses both dimensionality reduction and variable selection, with a focus on fMRI analysis, visualization, and applications of visualization in fMRI analysis. Three new algorithms are introduced.
The first algorithm uses a sparse Canonical Correlation Analysis model and a high-dimensional stimulus representation to find relevant voxels (variables) in fMRI experiments with complex natural stimuli. Experiments on a data set involving music show that the algorithm successfully retrieves voxels relevant to the experimental condition.
The second algorithm, NeRV, is a dimensionality reduction method for visualization high-dimensional data using scatterplots. A simple abstract model of the way a human studies a scatterplot is formulated, and NeRV is derived as an algorithm for producing optimal visualizations in terms of this model. Experiments show that NeRV is superior to conventional dimensionality reduction methods in terms of this model. NeRV is also used to perform a novel form of exploratory data analysis on the fMRI voxels selected by the first algorithm; the analysis simultaneously demonstrates the usefulness of NeRV in practice and offers further insights into the performance of the voxel selection algorithm.
The third algorithm, LDA-NeRV, combines a Bayesian latent-variable model for graphs with NeRV to produce one of the first principled graph drawing methods. Experiments show that LDA-NeRV is capable of visualizing structure that conventional graph drawing methods fail to reveal.Monilla aloilla esiintyy tarve korkeaulotteisen, kohinaisen datan analysoimiseen. Algorithminen dimensionpudotus tai muuttujanvalinta ovat usein sovellettavia lähestymistapoja, joko muuta analyysiä edeltävänä esikäsittelynä tai itsenäisenä analyysinä.
Tässä työssä käsitellään sekä dimensionpudotusta että muuttujanvalintaa, keskittyen erityisesti fMRI-dataaan ja visualisointiin. Työssä esitellään kolme uutta algoritmia.
Ensimmäinen algoritmi käyttää harvaa kanonista korrelaaioanalyysi-mallia (CCA) ja koeärsykkeen korkeaulotteista piirre-esitystä olennaisten vokseleiden (muuttujien) valitsemiseen fMRI-kokeissa, joissa koehenkilöt altistetaan monimutkaiselle luonnolliselle ärsykkeelle, kuten esimerkiksi musiikille. Kokeet musiikkia ärsykkeenä käyttävän fMRI-kokeen kanssa osoittavat algoritmin löytävän tärkeitä vokseleita.
Toinen algoritmi, NeRV, on dimensionpudotusmenetelmä korkeaulotteisen datan visualisoimiseen hajontakuvion avulla. NeRV pohjautuu yksinkertaiseen abstraktiin malliin ihmisen tavalle tulkita hajontakuviota. Kokeet osoittavat NeRVin olevan perinteisiä menetelmiä parempi tämän visualisointimallin mielessä. Lisäksi NeRViä sovelletaan ensimmäisen algoritmin valitsemien fMRI-vokseleiden visuaaliseen analyysiin; analyysi sekä osoittaa NeRVin hyödyllisyyden käytännössä että tarjoaa uusia näkökulmia vokselinvalintatulosten ymmärtämiseen.
Kolmas algoritmi, LDA-NeRV, on NeRViä ja bayesiläistä latenttimuuttujamallia soveltava visualisointimenetelmä graafeille. Kokeet osoittavat LDA-NeRVin kykenevän visualisoimaan rakennetta, jota perinteiset visualisointimenetelmät eivät tuo esiin
Projection Based Models for High Dimensional Data
In recent years, many machine learning applications have arisen which deal with the
problem of finding patterns in high dimensional data. Principal component analysis
(PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction
by estimating latent factors which minimise the reconstruction error between
the original data and its low-dimensional projection. We initially consider a situation
where influential observations exist within the dataset which have a large,
adverse affect on the estimated PCA model. We propose a measure of “predictive
influence” to detect these points based on the contribution of each point to the
leave-one-out reconstruction error of the model using an analytic PRedicted REsidual
Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA
to deal with the presence of influential observations and outliers which minimizes
the predictive reconstruction error.
In some applications there may be unobserved clusters in the data, for which
fitting PCA models to subsets of the data would provide a better fit. This is known
as the subspace clustering problem. We develop a novel algorithm for subspace
clustering which iteratively fits PCA models to subsets of the data and assigns observations
to clusters based on their predictive influence on the reconstruction error.
We study the convergence of the algorithm and compare its performance to a number
of subspace clustering methods on simulated data and in real applications from
computer vision involving clustering object trajectories in video sequences and images
of faces.
We extend our predictive clustering framework to a setting where two high-dimensional
views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover
clusters which are maximally predictive between the views. In this setting two block
partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality
reduction in both views by estimating latent factors that are highly predictive. We
fit TB-PLS models to subsets of data and assign points to clusters based on their
predictive influence under each model which is evaluated using a PRESS statistic.
We compare our method to state of the art algorithms in real applications in webpage
and document clustering and find that our approach to predictive clustering
yields superior results.
Finally, we propose a method for dynamically tracking multivariate data streams
based on PLS. Our method learns a linear regression function from multivariate
input and output streaming data in an incremental fashion while also performing
dimensionality reduction and variable selection. Moreover, the recursive regression
model is able to adapt to sudden changes in the data generating mechanism and also
identifies the number of latent factors. We apply our method to the enhanced index
tracking problem in computational finance
scikit-fda: A Python Package for Functional Data Analysis
The library scikit-fda is a Python package for Functional Data Analysis
(FDA). It provides a comprehensive set of tools for representation,
preprocessing, and exploratory analysis of functional data. The library is
built upon and integrated in Python's scientific ecosystem. In particular, it
conforms to the scikit-learn application programming interface so as to take
advantage of the functionality for machine learning provided by this package:
pipelines, model selection, and hyperparameter tuning, among others. The
scikit-fda package has been released as free and open-source software under a
3-Clause BSD license and is open to contributions from the FDA community. The
library's extensive documentation includes step-by-step tutorials and detailed
examples of use
Peak selection in metabolic profiles using functional data analysis
In this thesis we describe sparse principal component analysis (PCA) methods and apply
them to the analysis of short multivariate time series in order to perform both dimensionality
reduction and variable selection. We take a functional data analysis (FDA) modelling
approach in which each time series is treated as a continuous smooth function of time or
curve.
These techniques have been applied to analyse time series data arising in the area
of metabonomics. Metabonomics studies chemical processes involving small molecule
metabolites in a cell. We use experimental data obtained from the COnsortium for MEtabonomic
Toxicology (COMET) project which is formed by six pharmaceutical companies and
Imperial College London, UK. In the COMET project repeated measurements of several
metabolites over time were collected which are taken from rats subjected to different drug
treatments. The aim of our study is to detect important metabolites by analysing the multivariate
time series.
Multivariate functional PCA is an exploratory technique to describe the observed time
series. In its standard form, PCA involves linear combinations of all variables (i.e. metabolite
peaks) and does not perform variable selection. In order to select a subset of important
metabolites we introduce sparsity into the model. We develop a novel functional Sparse
Grouped Principal Component Analysis (SGPCA) algorithm using ideas related to Least
Absolute Shrinkage and Selection Operator (LASSO), a regularized regression technique,
with grouped variables. This SGPCA algorithm detects a sparse linear combination of
metabolites which explain a large proportion of the variance. Apart from SGPCA, we also propose two alternative approaches for metabolite selection. The first one is based on
thresholding the multivariate functional PCA solution, while the second method computes
the variance of each metabolite curve independently and then proceeds to these rank curves
in decreasing order of importance. To the best of our knowledge, this is the first application
of sparse functional PCA methods to the problem of modelling multivariate metabonomic
time series data and selecting a subset of metabolite peaks.
We present comprehensive experimental results using simulated data and COMET project
data for different multivariate and functional PCA variants from the literature and for SGPCA
. Simulation results show that that the SGPCA algorithm recovers a high proportion
of truly important metabolite variables. Furthermore, in the case of SGPCA applied to the
COMET dataset we identify a small number of important metabolites independently for
two different treatment conditions. A comparison of selected metabolites in both treatment
conditions reveals that there is an overlap of over 75 percent
Evolutionary Computation and QSAR Research
[Abstract] The successful high throughput screening of molecule libraries for a specific biological property is one of the main improvements in drug discovery. The virtual molecular filtering and screening relies greatly on quantitative structure-activity relationship (QSAR) analysis, a mathematical model that correlates the activity of a molecule with molecular descriptors. QSAR models have the potential to reduce the costly failure of drug candidates in advanced (clinical) stages by filtering combinatorial libraries, eliminating candidates with a predicted toxic effect and poor pharmacokinetic profiles, and reducing the number of experiments. To obtain a predictive and reliable QSAR model, scientists use methods from various fields such as molecular modeling, pattern recognition, machine learning or artificial intelligence. QSAR modeling relies on three main steps: molecular structure codification into molecular descriptors, selection of relevant variables in the context of the analyzed activity, and search of the optimal mathematical model that correlates the molecular descriptors with a specific activity. Since a variety of techniques from statistics and artificial intelligence can aid variable selection and model building steps, this review focuses on the evolutionary computation methods supporting these tasks. Thus, this review explains the basic of the genetic algorithms and genetic programming as evolutionary computation approaches, the selection methods for high-dimensional data in QSAR, the methods to build QSAR models, the current evolutionary feature selection methods and applications in QSAR and the future trend on the joint or multi-task feature selection methods.Instituto de Salud Carlos III, PIO52048Instituto de Salud Carlos III, RD07/0067/0005Ministerio de Industria, Comercio y Turismo; TSI-020110-2009-53)Galicia. Consellería de Economía e Industria; 10SIN105004P
Computational Intelligence Techniques for OES Data Analysis
Semiconductor manufacturers are forced by market demand to continually
deliver lower cost and faster devices. This results in complex industrial processes
that, with continuous evolution, aim to improve quality and reduce
costs. Plasma etching processes have been identified as a critical part of the
production of semiconductor devices. It is therefore important to have good
control over plasma etching but this is a challenging task due to the complex
physics involved.
Optical Emission Spectroscopy (OES) measurements can be collected
non-intrusively during wafer processing and are being used more and more
in semiconductor manufacturing as they provide real time plasma chemical
information. However, the use of OES measurements is challenging due to
its complexity, high dimension and the presence of many redundant variables.
The development of advanced analysis algorithms for virtual metrology,
anomaly detection and variables selection is fundamental in order to
effectively use OES measurements in a production process.
This thesis focuses on computational intelligence techniques for OES data
analysis in semiconductor manufacturing presenting both theoretical results
and industrial application studies. To begin with, a spectrum alignment
algorithm is developed to align OES measurements from different sensors.
Then supervised variables selection algorithms are developed. These are defined
as improved versions of the LASSO estimator with the view to selecting
a more stable set of variables and better prediction performance in virtual
metrology applications. After this, the focus of the thesis moves to the unsupervised
variables selection problem. The Forward Selection Component
Analysis (FSCA) algorithm is improved with the introduction of computationally
efficient implementations and different refinement procedures. Nonlinear
extensions of FSCA are also proposed. Finally, the fundamental topic
of anomaly detection is investigated and an unsupervised variables selection
algorithm tailored to anomaly detection is developed. In addition, it is shown
how OES data can be effectively used for semi-supervised anomaly detection
in a semiconductor manufacturing process.
The developed algorithms open up opportunities for the effective use of
OES data for advanced process control. All the developed methodologies
require minimal user intervention and provide easy to interpret models. This
makes them practical for engineers to use during production for process monitoring
and for in-line detection and diagnosis of process issues, thereby resulting
in an overall improvement in production performance