25 research outputs found

    Dimensionpudotusmenetelmiä fMRI-analyysissä ja visualisoinnissa

    Get PDF
    The need to model and understand high-dimensional, noisy data sets is common in many domains these day, among them neuroimaging and fMRI analysis. Dimensionality reduction and variable selection are two common strategies for dealing with high-dimensional data, either as a pre-processing step prior to further analysis, or as an analysis step itself. This thesis discusses both dimensionality reduction and variable selection, with a focus on fMRI analysis, visualization, and applications of visualization in fMRI analysis. Three new algorithms are introduced. The first algorithm uses a sparse Canonical Correlation Analysis model and a high-dimensional stimulus representation to find relevant voxels (variables) in fMRI experiments with complex natural stimuli. Experiments on a data set involving music show that the algorithm successfully retrieves voxels relevant to the experimental condition. The second algorithm, NeRV, is a dimensionality reduction method for visualization high-dimensional data using scatterplots. A simple abstract model of the way a human studies a scatterplot is formulated, and NeRV is derived as an algorithm for producing optimal visualizations in terms of this model. Experiments show that NeRV is superior to conventional dimensionality reduction methods in terms of this model. NeRV is also used to perform a novel form of exploratory data analysis on the fMRI voxels selected by the first algorithm; the analysis simultaneously demonstrates the usefulness of NeRV in practice and offers further insights into the performance of the voxel selection algorithm. The third algorithm, LDA-NeRV, combines a Bayesian latent-variable model for graphs with NeRV to produce one of the first principled graph drawing methods. Experiments show that LDA-NeRV is capable of visualizing structure that conventional graph drawing methods fail to reveal.Monilla aloilla esiintyy tarve korkeaulotteisen, kohinaisen datan analysoimiseen. Algorithminen dimensionpudotus tai muuttujanvalinta ovat usein sovellettavia lähestymistapoja, joko muuta analyysiä edeltävänä esikäsittelynä tai itsenäisenä analyysinä. Tässä työssä käsitellään sekä dimensionpudotusta että muuttujanvalintaa, keskittyen erityisesti fMRI-dataaan ja visualisointiin. Työssä esitellään kolme uutta algoritmia. Ensimmäinen algoritmi käyttää harvaa kanonista korrelaaioanalyysi-mallia (CCA) ja koeärsykkeen korkeaulotteista piirre-esitystä olennaisten vokseleiden (muuttujien) valitsemiseen fMRI-kokeissa, joissa koehenkilöt altistetaan monimutkaiselle luonnolliselle ärsykkeelle, kuten esimerkiksi musiikille. Kokeet musiikkia ärsykkeenä käyttävän fMRI-kokeen kanssa osoittavat algoritmin löytävän tärkeitä vokseleita. Toinen algoritmi, NeRV, on dimensionpudotusmenetelmä korkeaulotteisen datan visualisoimiseen hajontakuvion avulla. NeRV pohjautuu yksinkertaiseen abstraktiin malliin ihmisen tavalle tulkita hajontakuviota. Kokeet osoittavat NeRVin olevan perinteisiä menetelmiä parempi tämän visualisointimallin mielessä. Lisäksi NeRViä sovelletaan ensimmäisen algoritmin valitsemien fMRI-vokseleiden visuaaliseen analyysiin; analyysi sekä osoittaa NeRVin hyödyllisyyden käytännössä että tarjoaa uusia näkökulmia vokselinvalintatulosten ymmärtämiseen. Kolmas algoritmi, LDA-NeRV, on NeRViä ja bayesiläistä latenttimuuttujamallia soveltava visualisointimenetelmä graafeille. Kokeet osoittavat LDA-NeRVin kykenevän visualisoimaan rakennetta, jota perinteiset visualisointimenetelmät eivät tuo esiin

    Projection Based Models for High Dimensional Data

    Get PDF
    In recent years, many machine learning applications have arisen which deal with the problem of finding patterns in high dimensional data. Principal component analysis (PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction by estimating latent factors which minimise the reconstruction error between the original data and its low-dimensional projection. We initially consider a situation where influential observations exist within the dataset which have a large, adverse affect on the estimated PCA model. We propose a measure of “predictive influence” to detect these points based on the contribution of each point to the leave-one-out reconstruction error of the model using an analytic PRedicted REsidual Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA to deal with the presence of influential observations and outliers which minimizes the predictive reconstruction error. In some applications there may be unobserved clusters in the data, for which fitting PCA models to subsets of the data would provide a better fit. This is known as the subspace clustering problem. We develop a novel algorithm for subspace clustering which iteratively fits PCA models to subsets of the data and assigns observations to clusters based on their predictive influence on the reconstruction error. We study the convergence of the algorithm and compare its performance to a number of subspace clustering methods on simulated data and in real applications from computer vision involving clustering object trajectories in video sequences and images of faces. We extend our predictive clustering framework to a setting where two high-dimensional views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover clusters which are maximally predictive between the views. In this setting two block partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality reduction in both views by estimating latent factors that are highly predictive. We fit TB-PLS models to subsets of data and assign points to clusters based on their predictive influence under each model which is evaluated using a PRESS statistic. We compare our method to state of the art algorithms in real applications in webpage and document clustering and find that our approach to predictive clustering yields superior results. Finally, we propose a method for dynamically tracking multivariate data streams based on PLS. Our method learns a linear regression function from multivariate input and output streaming data in an incremental fashion while also performing dimensionality reduction and variable selection. Moreover, the recursive regression model is able to adapt to sudden changes in the data generating mechanism and also identifies the number of latent factors. We apply our method to the enhanced index tracking problem in computational finance

    Forward Selection Component Analysis: Algorithms and Applications

    Get PDF

    scikit-fda: A Python Package for Functional Data Analysis

    Full text link
    The library scikit-fda is a Python package for Functional Data Analysis (FDA). It provides a comprehensive set of tools for representation, preprocessing, and exploratory analysis of functional data. The library is built upon and integrated in Python's scientific ecosystem. In particular, it conforms to the scikit-learn application programming interface so as to take advantage of the functionality for machine learning provided by this package: pipelines, model selection, and hyperparameter tuning, among others. The scikit-fda package has been released as free and open-source software under a 3-Clause BSD license and is open to contributions from the FDA community. The library's extensive documentation includes step-by-step tutorials and detailed examples of use

    Peak selection in metabolic profiles using functional data analysis

    No full text
    In this thesis we describe sparse principal component analysis (PCA) methods and apply them to the analysis of short multivariate time series in order to perform both dimensionality reduction and variable selection. We take a functional data analysis (FDA) modelling approach in which each time series is treated as a continuous smooth function of time or curve. These techniques have been applied to analyse time series data arising in the area of metabonomics. Metabonomics studies chemical processes involving small molecule metabolites in a cell. We use experimental data obtained from the COnsortium for MEtabonomic Toxicology (COMET) project which is formed by six pharmaceutical companies and Imperial College London, UK. In the COMET project repeated measurements of several metabolites over time were collected which are taken from rats subjected to different drug treatments. The aim of our study is to detect important metabolites by analysing the multivariate time series. Multivariate functional PCA is an exploratory technique to describe the observed time series. In its standard form, PCA involves linear combinations of all variables (i.e. metabolite peaks) and does not perform variable selection. In order to select a subset of important metabolites we introduce sparsity into the model. We develop a novel functional Sparse Grouped Principal Component Analysis (SGPCA) algorithm using ideas related to Least Absolute Shrinkage and Selection Operator (LASSO), a regularized regression technique, with grouped variables. This SGPCA algorithm detects a sparse linear combination of metabolites which explain a large proportion of the variance. Apart from SGPCA, we also propose two alternative approaches for metabolite selection. The first one is based on thresholding the multivariate functional PCA solution, while the second method computes the variance of each metabolite curve independently and then proceeds to these rank curves in decreasing order of importance. To the best of our knowledge, this is the first application of sparse functional PCA methods to the problem of modelling multivariate metabonomic time series data and selecting a subset of metabolite peaks. We present comprehensive experimental results using simulated data and COMET project data for different multivariate and functional PCA variants from the literature and for SGPCA . Simulation results show that that the SGPCA algorithm recovers a high proportion of truly important metabolite variables. Furthermore, in the case of SGPCA applied to the COMET dataset we identify a small number of important metabolites independently for two different treatment conditions. A comparison of selected metabolites in both treatment conditions reveals that there is an overlap of over 75 percent

    Evolutionary Computation and QSAR Research

    Get PDF
    [Abstract] The successful high throughput screening of molecule libraries for a specific biological property is one of the main improvements in drug discovery. The virtual molecular filtering and screening relies greatly on quantitative structure-activity relationship (QSAR) analysis, a mathematical model that correlates the activity of a molecule with molecular descriptors. QSAR models have the potential to reduce the costly failure of drug candidates in advanced (clinical) stages by filtering combinatorial libraries, eliminating candidates with a predicted toxic effect and poor pharmacokinetic profiles, and reducing the number of experiments. To obtain a predictive and reliable QSAR model, scientists use methods from various fields such as molecular modeling, pattern recognition, machine learning or artificial intelligence. QSAR modeling relies on three main steps: molecular structure codification into molecular descriptors, selection of relevant variables in the context of the analyzed activity, and search of the optimal mathematical model that correlates the molecular descriptors with a specific activity. Since a variety of techniques from statistics and artificial intelligence can aid variable selection and model building steps, this review focuses on the evolutionary computation methods supporting these tasks. Thus, this review explains the basic of the genetic algorithms and genetic programming as evolutionary computation approaches, the selection methods for high-dimensional data in QSAR, the methods to build QSAR models, the current evolutionary feature selection methods and applications in QSAR and the future trend on the joint or multi-task feature selection methods.Instituto de Salud Carlos III, PIO52048Instituto de Salud Carlos III, RD07/0067/0005Ministerio de Industria, Comercio y Turismo; TSI-020110-2009-53)Galicia. Consellería de Economía e Industria; 10SIN105004P

    Computational Intelligence Techniques for OES Data Analysis

    Get PDF
    Semiconductor manufacturers are forced by market demand to continually deliver lower cost and faster devices. This results in complex industrial processes that, with continuous evolution, aim to improve quality and reduce costs. Plasma etching processes have been identified as a critical part of the production of semiconductor devices. It is therefore important to have good control over plasma etching but this is a challenging task due to the complex physics involved. Optical Emission Spectroscopy (OES) measurements can be collected non-intrusively during wafer processing and are being used more and more in semiconductor manufacturing as they provide real time plasma chemical information. However, the use of OES measurements is challenging due to its complexity, high dimension and the presence of many redundant variables. The development of advanced analysis algorithms for virtual metrology, anomaly detection and variables selection is fundamental in order to effectively use OES measurements in a production process. This thesis focuses on computational intelligence techniques for OES data analysis in semiconductor manufacturing presenting both theoretical results and industrial application studies. To begin with, a spectrum alignment algorithm is developed to align OES measurements from different sensors. Then supervised variables selection algorithms are developed. These are defined as improved versions of the LASSO estimator with the view to selecting a more stable set of variables and better prediction performance in virtual metrology applications. After this, the focus of the thesis moves to the unsupervised variables selection problem. The Forward Selection Component Analysis (FSCA) algorithm is improved with the introduction of computationally efficient implementations and different refinement procedures. Nonlinear extensions of FSCA are also proposed. Finally, the fundamental topic of anomaly detection is investigated and an unsupervised variables selection algorithm tailored to anomaly detection is developed. In addition, it is shown how OES data can be effectively used for semi-supervised anomaly detection in a semiconductor manufacturing process. The developed algorithms open up opportunities for the effective use of OES data for advanced process control. All the developed methodologies require minimal user intervention and provide easy to interpret models. This makes them practical for engineers to use during production for process monitoring and for in-line detection and diagnosis of process issues, thereby resulting in an overall improvement in production performance
    corecore