19 research outputs found

    Extending principal covariates regression for high-dimensional multi-block data

    Get PDF
    This dissertation addresses the challenge of deciphering extensive datasets collected from multiple sources, such as health habits and genetic information, in the context of studying complex issues like depression. A data analysis method known as Principal Covariate Regression (PCovR) provides a strong basis in this challenge.Yet, analyzing these intricate datasets is far from straightforward. The data often contain redundant and irrelevant variables, making it difficult to extract meaningful insights. Furthermore, these data may involve different types of outcome variables (for instance, the variable pertaining to depression could manifest as a score from a depression scale or a binary diagnosis (yes/no) from a medical professional), adding another layer of complexity.To overcome these obstacles, novel adaptations of PCovR are proposed in this dissertation. The methods automatically select important variables, categorize insights into those originating from a single source or multiple sources, and accommodate various outcome variable types. The effectiveness of these methods is demonstrated in predicting outcomes and revealing the subtle relationships within data from multiple sources.Moreover, the dissertation offers a glimpse of future directions in enhancing PCovR. Implications of extending the method such that it selects important variables are critically examined. Also, an algorithm that has the potential to yield optimal results is suggested. In conclusion, this dissertation proposes methods to tackle the complexity of large data from multiple sources, and points towards where opportunities may lie in the next line of research

    Modeling Multiple-Subject and Discrete-Valued High-Dimensional Time Series

    Get PDF
    This thesis focuses on two separate topics in modeling of high-dimensional time series (HDTS) with several structures and their various applications. The first topic is on modeling HDTS from multiple subjects. Here, the structures of interest include model components that are shared by all subjects and that are individual to subjects or their groups. A running theme in this modeling is the heterogeneity of subjects. Dealing with heterogeneous data has been of particular interest recently in social, health, behavioral, and other sciences. The second topic is on modeling HDTS that are discrete-valued, including binary, categorical, and non-negative count observations. Compared with continuous time series modeling where autoregressive-type models dominate, there are no generally preferred models in the discrete setting. The models considered in this thesis are based on latent Gaussian processes, which drive the dynamics of the observed discrete-valued series. The models have the advantages of allowing negative autocorrelations, and flexible choices of marginal distributions of discrete observations. The thesis consists of four projects, with two on each topic. The first project proposes a stratified Lasso (multi-task learning) formulation for vector autoregressive (VAR) models from multiple subjects. The VAR transition matrices are decomposed additively into the common components shared across all subjects and individual components specific to each subject. An efficient estimation procedure combined with cross-validation for several tuning parameters is designed. The simulation study shows that the approach performs well in the presence of heterogeneity across individual dynamics for the different levels of sparsity. The model is applied to intensive longitudinal data of the emotional states to reveal common and individual temporal dependences of daily emotions across study participants. The proposed model enhances interpretability and forecasting performance, which are expected to be beneficial in assessing conflicting evidence from empirical studies and establishing universal explanations of the studied phenomenon. The second project develops integrative dynamic factor models (DFMs) for multiple subjects in several groups. The models have components that allow one to explore the inter-differences across subjects (and groups). At the same time, the intra-differences can be investigated by reconstructing the individual temporal dynamics of different subjects. A flexible identifiability condition on the factor covariance is adopted, which expands the scope of heterogeneity and contributes to better model interpretation and forecasting results. From a methodological standpoint, a novel algorithm that combines non-iterative block segmentation, efficient rank selection, and variants of PCA for multiple subjects, is suggested. Simulations under various scenarios and analysis of resting-state functional MRI data collected from multiple subjects are conducted. The third project concerns latent Gaussian DFMs for count HDTS. The proposed estimation procedure combines the classical PCA, Yule-Walker equations, and link functions, which are pairwise mappings of the second-order properties of the latent and observed time series. The forecasting is carried out through a particle-based sequential Monte Carlo method, which approximates predictions of counts, driven by the latent DFM generated through Kalman recursions. Simulation results reveal that the estimation approach performs similarly to the usual DFMs, and the model provides better forecasting results than the considered benchmarks. The model is applied to item response data from psychology, where the existence of latent factors has been verified but their temporal dependence has not been studied yet. The fourth project considers the analogous models for count HDTS but where the latent Gaussian time series follows a sparse VAR. A penalized estimation procedure based on Lasso and its adaptive form is explored for latent Gaussian VAR. An alternative proposed formulation leverages the second-order properties of the latent process directly. Along with the estimation of link functions, we suggest a data-splitting strategy, which can select tuning parameters for penalization. Simulations under various marginal count distributions and patterns of transition matrices are performed. A data example of major depressive disorder in psychiatry is considered to illustrate the modeling approach.Doctor of Philosoph

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Extending principal covariates regression for high-dimensional multi-block data

    Get PDF
    This dissertation addresses the challenge of deciphering extensive datasets collected from multiple sources, such as health habits and genetic information, in the context of studying complex issues like depression. A data analysis method known as Principal Covariate Regression (PCovR) provides a strong basis in this challenge.Yet, analyzing these intricate datasets is far from straightforward. The data often contain redundant and irrelevant variables, making it difficult to extract meaningful insights. Furthermore, these data may involve different types of outcome variables (for instance, the variable pertaining to depression could manifest as a score from a depression scale or a binary diagnosis (yes/no) from a medical professional), adding another layer of complexity.To overcome these obstacles, novel adaptations of PCovR are proposed in this dissertation. The methods automatically select important variables, categorize insights into those originating from a single source or multiple sources, and accommodate various outcome variable types. The effectiveness of these methods is demonstrated in predicting outcomes and revealing the subtle relationships within data from multiple sources.Moreover, the dissertation offers a glimpse of future directions in enhancing PCovR. Implications of extending the method such that it selects important variables are critically examined. Also, an algorithm that has the potential to yield optimal results is suggested. In conclusion, this dissertation proposes methods to tackle the complexity of large data from multiple sources, and points towards where opportunities may lie in the next line of research

    Generative adversarial networks for sequential learning

    Get PDF
    Generative modelling aims to learn the data generating mechanism from observations without supervision. It is a desirable and natural approach for learning unlabelled data which is easily accessible. Deep generative models refer to a class of generative models combined with the usage of deep learning techniques, taking advantage of the intuitive principles of generative models as well as the expressiveness and flexibility of neural networks. The applications of generative modelling include image, audio, and video synthesis, text summarisation and translation, and so on. The methods developed in this thesis particularly emphasise on domains involving data of sequential nature, such as video generation and prediction, weather forecasting, and dynamic 3D reconstruction. Firstly, we introduce a new adversarial algorithm for training generative models suitable for sequential data. This algorithm is built on the theory of Causal Optimal Transport (COT) which constrains the transport plans to respect the temporal dependencies exhibited in the data. Secondly, the algorithm is extended to learn conditional sequences, that is, how a sequence is likely to evolve given the observation of its past evolution. Meanwhile, we work with the modified empirical measures to guarantee the convergence of the COT distance when the sequences do not overlap at any time step. Thirdly, we show that state-of-the-art results in the complex spatio-temporal modelling using GANs can be further improved by leveraging prior knowledge in the spatial-temporal correlation in the domain of weather forecasting. Finally, we demonstrate how deep generative models can be adopted to address a classical statistical problem of conditional independence testing. A class of classic approaches for such a task requires computing a test statistic using samples drawn from two unknown conditional distributions. We therefore present a double GANs framework to learn two generative models that approximate both conditional distributions. The success of this approach sheds light on how certain challenging statistical problems can benefit from the adequate learning results as well as the efficient sampling procedure of deep generative models

    Joint spectral embeddings of random dot product graphs

    Get PDF
    Multiplex networks describe a set of entities, with multiple relationships among them, as a collection of networks over a common vertex set. Multiplex networks naturally describe complex systems where units connect across different modalities whereas single network data only permits a single relationship type. Joint spectral embedding methods facilitate analysis of multiplex network data by simultaneously mapping vertices in each network to points in Euclidean space, entitled node embeddings, where statistical inference is then performed. This mapping is performed by spectrally decomposing a matrix that summarizes the multiplex network. Different methods decompose different matrices and hence yield different node embeddings. This dissertation analyzes a class of joint spectral embedding methods which provides a foundation to compare these different approaches to multiple network inference. We compare joint spectral embedding methods in three ways. First, we extend the Random Dot Product Graph model to multiplex network data and establish the statistical properties of node embeddings produced by each method under this model. This analysis facilitates a full bias-variance analysis of each method and uncovers connections between these methods and methods for dimensionality reduction. Second, we compare the accuracy of algorithms which utilize these different node embeddings in a variety of multiple network inference tasks including community detection, vertex anomaly detection, and graph hypothesis testing. Finally, we perform a time and space complexity analysis of each method and present a case study in which we analyze interactions between New England sports fans on the social news aggregation and discussion website, Reddit. These findings provide a theoretical and practical guide to compare joint spectral embedding techniques and highlight the benefits and drawbacks of utilizing each method in practice

    Associating Multi-modal Brain Imaging Phenotypes and Genetic Risk Factors via A Dirty Multi-task Learning Method

    Get PDF
    Brain imaging genetics becomes more and more important in brain science, which integrates genetic variations and brain structures or functions to study the genetic basis of brain disorders. The multi-modal imaging data collected by different technologies, measuring the same brain distinctly, might carry complementary information. Unfortunately, we do not know the extent to which the phenotypic variance is shared among multiple imaging modalities, which further might trace back to the complex genetic mechanism. In this paper, we propose a novel dirty multi-task sparse canonical correlation analysis (SCCA) to study imaging genetic problems with multi-modal brain imaging quantitative traits (QTs) involved. The proposed method takes advantages of the multi-task learning and parameter decomposition. It can not only identify the shared imaging QTs and genetic loci across multiple modalities, but also identify the modality-specific imaging QTs and genetic loci, exhibiting a flexible capability of identifying complex multi-SNP-multi-QT associations. Using the state-of-the-art multi-view SCCA and multi-task SCCA, the proposed method shows better or comparable canonical correlation coefficients and canonical weights on both synthetic and real neuroimaging genetic data. In addition, the identified modality-consistent biomarkers, as well as the modality-specific biomarkers, provide meaningful and interesting information, demonstrating the dirty multi-task SCCA could be a powerful alternative method in multi-modal brain imaging genetics

    Detecting genetic associations with brain imaging phenotypes in Alzheimer’s disease via a novel structured SCCA approach

    Get PDF
    Brain imaging genetics becomes an important research topic since it can reveal complex associations between genetic factors and the structures or functions of the human brain. Sparse canonical correlation analysis (SCCA) is a popular bi-multivariate association identification method. To mine the complex genetic basis of brain imaging phenotypes, there arise many SCCA methods with a variety of norms for incorporating different structures of interest. They often use the group lasso penalty, the fused lasso or the graph/network guided fused lasso ones. However, the group lasso methods have limited capability because of the incomplete or unavailable prior knowledge in real applications. The fused lasso and graph/network guided methods are sensitive to the sign of the sample correlation which may be incorrectly estimated. In this paper, we introduce two new penalties to improve the fused lasso and the graph/network guided lasso penalties in structured sparse learning. We impose both penalties to the SCCA model and propose an optimization algorithm to solve it. The proposed SCCA method has a strong upper bound of grouping effects for both positively and negatively highly correlated variables. We show that, on both synthetic and real neuroimaging genetics data, the proposed SCCA method performs better than or equally to the conventional methods using fused lasso or graph/network guided fused lasso. In particular, the proposed method identifies higher canonical correlation coefficients and captures clearer canonical weight patterns, demonstrating its promising capability in revealing biologically meaningful imaging genetic associations
    corecore