430 research outputs found

    Tensor Regression

    Full text link
    Regression analysis is a key area of interest in the field of data analysis and machine learning which is devoted to exploring the dependencies between variables, often using vectors. The emergence of high dimensional data in technologies such as neuroimaging, computer vision, climatology and social networks, has brought challenges to traditional data representation methods. Tensors, as high dimensional extensions of vectors, are considered as natural representations of high dimensional data. In this book, the authors provide a systematic study and analysis of tensor-based regression models and their applications in recent years. It groups and illustrates the existing tensor-based regression methods and covers the basics, core ideas, and theoretical characteristics of most tensor-based regression methods. In addition, readers can learn how to use existing tensor-based regression methods to solve specific regression tasks with multiway data, what datasets can be selected, and what software packages are available to start related work as soon as possible. Tensor Regression is the first thorough overview of the fundamentals, motivations, popular algorithms, strategies for efficient implementation, related applications, available datasets, and software resources for tensor-based regression analysis. It is essential reading for all students, researchers and practitioners of working on high dimensional data.Comment: 187 pages, 32 figures, 10 table

    Sparse reduced-rank regression for imaging genetics studies: models and applications

    Get PDF
    We present a novel statistical technique; the sparse reduced rank regression (sRRR) model which is a strategy for multivariate modelling of high-dimensional imaging responses and genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity in the regression coefficients, identifying subsets of genetic markers that best explain the variability observed in subsets of the phenotypes. To properly exploit the rich structure present in each of the imaging and genetics domains, we additionally propose the use of several structured penalties within the sRRR model. Using simulation procedures that accurately reflect realistic imaging genetics data, we present detailed evaluations of the sRRR method in comparison with the more traditional univariate linear modelling approach. In all settings considered, we show that sRRR possesses better power to detect the deleterious genetic variants. Moreover, using a simple genetic model, we demonstrate the potential benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to extracting averages over regions of interest in the brain. Since this entails the use of phenotypic vectors of enormous dimensionality, we suggest the use of a sparse classification model as a de-noising step, prior to the imaging genetics study. Finally, we present the application of a data re-sampling technique within the sRRR model for model selection. Using this approach we are able to rank the genetic markers in order of importance of association to the phenotypes, and similarly rank the phenotypes in order of importance to the genetic markers. In the very end, we illustrate the application perspective of the proposed statistical models in three real imaging genetics datasets and highlight some potential associations

    Investigation of Multi-dimensional Tensor Multi-task Learning for Modeling Alzheimer's Disease Progression

    Get PDF
    Machine learning (ML) techniques for predicting Alzheimer's disease (AD) progression can significantly assist clinicians and researchers in constructing effective AD prevention and treatment strategies. The main constraints on the performance of current ML approaches are prediction accuracy and stability problems in medical small dataset scenarios, monotonic data formats (loss of multi-dimensional knowledge of the data and loss of correlation knowledge between biomarkers) and biomarker interpretability limitations. This thesis investigates how multi-dimensional information and knowledge from biomarker data integrated with multi-task learning approaches to predict AD progression. Firstly, a novel similarity-based quantification approach is proposed with two components: multi-dimensional knowledge vector construction and amalgamated magnitude-direction quantification of brain structural variation, which considers both the magnitude and directional correlations of structural variation between brain biomarkers and encodes the quantified data as a third-order tensor to address the problem of monotonic data form. Secondly, multi-task learning regression algorithms with the ability to integrate multi-dimensional tensor data and mine MRI data for spatio-temporal structural variation information and knowledge were designed and constructed to improve the accuracy, stability and interpretability of AD progression prediction in medical small dataset scenarios. The algorithm consists of three components: supervised symmetric tensor decomposition for extracting biomarker latent factors, tensor multi-task learning regression and algorithmic regularisation terms. The proposed algorithm aims to extract a set of first-order latent factors from the raw data, each represented by its first biomarker, second biomarker and patient sample dimensions, to elucidate potential factors affecting the variability of the data in an interpretable manner and can be utilised as predictor variables for training the prediction model that regards the prediction of each patient as a task, with each task sharing a set of biomarker latent factors obtained from tensor decomposition. Knowledge sharing between tasks improves the generalisation ability of the model and addresses the problem of sparse medical data. The experimental results demonstrate that the proposed approach achieves superior accuracy and stability in predicting various cognitive scores of AD progression compared to single-task learning, benchmarks and state-of-the-art multi-task regression methods. The proposed approach identifies brain structural variations in patients and the important brain biomarker correlations revealed by the experiments can be utilised as potential indicators for AD early identification

    Bayesian Analysis of Ultra-High Dimensional Neuroimaging Data

    Get PDF
    Medical imaging technologies have been generating extremely complex data sets. This dissertation makes further contributions to the development of statistical tools motivated by modern biomedical challenges. Specifically we develop methods to characterize varying associations between ultra-high dimensional imaging data and low-dimensional clinical outcomes. The first part of this dissertation is motivated by the major limitations faced by traditional voxel-wise models, where voxels are commonly treated as independent units, and the assumption of Gaussian distribution of the neuroimaging measurements is usually flawed. We develop a class of hierarchical spatial transformation models to model the spatially varying associations between imaging measurements in a three-dimensional (3D) volume (or 2D surface) and a set of covariates. The proposed approach include a spatially varying Box-Cox transformation model and a Gaussian Markov random field model. The second part is motivated by the challenges faced by ultra-high dimensional datasets. In particular, we introduce a method to predict clinical outcomes from ultra-high dimensional covariates. The proposed models reduce dimensionality to a manageable level and further apply dimension reduction techniques, e.g. principal components analysis and tensor decompositions to extract and select low-dimensional important features.Doctor of Philosoph

    Causal Mediation Analysis with a Three-Dimensional Image Mediator

    Full text link
    Causal mediation analysis is increasingly abundant in biology, psychology, and epidemiology studies, etc. In particular, with the advent of the big data era, the issue of high-dimensional mediators is becoming more prevalent. In neuroscience, with the widespread application of magnetic resonance technology in the field of brain imaging, studies on image being a mediator emerged. In this study, a novel causal mediation analysis method with a three-dimensional image mediator is proposed. We define the average casual effects under the potential outcome framework, explore several sufficient conditions for the valid identification, and develop techniques for estimation and inference. To verify the effectiveness of the proposed method, a series of simulations under various scenarios is performed. Finally, the proposed method is applied to a study on the causal effect of mother^{\prime}s delivery mode on child^{\prime}s IQ development. It is found that the white matter in certain regions of the frontal-temporal areas has mediating effects.Comment: 35 pages, 9 figure

    Machine learning approaches for high-dimensional genome-wide association studies

    Get PDF
    Formålet med Genome-wide association studies (GWAS) er å finne statistiske sammenhenger mellom genetiske varianter og egenskaper av interesser. De genetiske variantene som forklarer mye av variasjonene i genomfattende genekspresjoner kan medføre konfunderende analyser av kvantitative egenskaper ved ekspresjonsplasseringer (eQTL). For å betrakte konfunderende faktorene, presenterte vi LVREML-metoden i artikkel I, en metode som er konseptuelt analogt med å estimere faste og tilfeldige effekter i Lineære Blandede modeller (LMM). Vi viste at de latente variablene med “Maximum likelihood” alltid kan velges ortogonalt til de kjente faktorene (som genetiske variasjoner). Dette indikerer at “Maximum likelihood” variablene forklarer utvalgsvariansene som ikke allerede er forklart av de genetiske variantene i modellen. For å kartlegge hvilke egenskaper som påvirkes av de identifiserte genetiske variantene, må vi reversere den funksjonelle relasjonen mellom genotyper og egenskaper. I denne sammenhengen er en “multi-trait” metode mer fordelaktige enn å studere egenskapene individuelt. “Multi-trait”-metoden drar nytte av økt kapasitet som følge av å vurdere kovarianser på tvers av egenskaper, og redusert multiple tester, fordi det trengs en enkelt test for å teste for sammenhenger til et sett med egenskaper. I artikkel II analyserte vi ulike maskinlæringsmetoder (Naive Bayes/independent univariate correlation, random forests og support vector machines) for omvendt regresjon i multi-trekk GWAS, ved bruk av genotyper, genuttrykksdata og “groundtruth” transcriptional regulatory networks fra DREAM5 SysGen Challenge og fra en krysning mellom to gjærstammer for å evaluere metoder. I artikkel III utvidet vi metoden ovenfor til å behandle menneskelig data. En viktig forskjell mellom data fra artikkel II og artikkel III er at vi ikke har “Groundtruth” data tilgjengelig for sistnevnte. Vi brukte genotypen og Magnetresonanstomografi (MRI) data hentet fra ADNI databasen. Resultatene fra både artikkel II og artikkel III viste at resultat av genotypeprediksjon varierte på tvers av genetiske varianter. Dette hjulpet med å identifisere genomiske regioner som er assosiert med stort antall egenskaper i høydimensjonale fenotypiske data. Vi observerte også at koeffisientene til maskinlæringsmodeller korrelerte med styrken til assosiasjonene mellom varianter og egenskaper. Resultatene våre viste også at ikke-lineære maskin-læringsmetoder som “random forests” identifiserte genetiske varianter tydeligere enn de lineære metodene. Spesielt observerte vi i artikkel III at “random forests” var i stand til å identifisere enkeltnukleotidpolymorfismer (SNP-er) som var forskjellige fra de som ble identifisert “ridge” og“lasso” regresjonsmetodene. Ytterligere analyse viste at de identifiserte SNP-ene tilhørte gener som tidligere var assosiert med hjernerelaterte lidelser.Genome-wide association studies (GWAS) aim to find statistical associations between genetic variants and traits of interests. The genetic variants that explain a lot of variation in genome-wide gene expression may lead to confounding in expression quantitative trait loci (eQTL) analyses. To account for these confounding factors, in Article I we proposed LVREML, a method conceptually analogous to estimating fixed and random effects in linear mixed models (LMM). We showed that the maximum-likelihood latent variables can always be chosen orthogonal to the known factors (such genetic variants). This indicates that the maximum-likelihood variables explain the sample covariances that is not already explained by the genetic variants in the model. For identifying which traits are effected by the identified genetic variants, we need to reverse the functional relation between genotypes and traits. In this regard, multitrait approaches are more advantageous than studying the traits individually. The multi-trait approaches benefit from increased power from considering cross-trait covariances and reduced multiple testing burden because a single test is needed to test for associations to a set of traits. In Article II, we analyzed various machine learning methods (ridge regression, Naive Bayes/independent univariate correlation, random forests and support vector machines) for reverse regression in multi-trait GWAS, using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains to evaluate methods. In Article III, we extended the above approach to human dataset. An important difference between data from Article II and Article III is that we do not have groundtruth data available for the latter. We used the genotype and brain-imaging features extracted from the MRIs obtained from the ADNI database. The results from both Article II and Article III showed that the genotype prediction performance varied across genetic variants. This helped in identifying genomic regions that are associated with high number of traits in high-dimensional phenotypic data. We also observed that the feature coefficients of fitted machine learning models correlated with the strength of association between variants and traits. Our results also showed that non-linear machine learning methods like random forests identified genetic variants distinct from the linear methods. In particular, we observed in Article III that random forest was able to identify single-nueclotide-polymorphisms (SNPs) that were distinct from the ones identified by ridge and lasso regression. Further analysis showed that the identified SNPs belonged to genes previously associated with brain-related disorders.Doktorgradsavhandlin
    corecore