4 research outputs found

    Advanced Analysis Methods for Large-Scale Structured Data

    Get PDF
    In the era of ’big data’, advanced storage and computing technologies allow people to build and process large-scale datasets, which promote the development of many fields such as speech recognition, natural language processing and computer vision. Traditional approaches can not handle the heterogeneity and complexity of some novel data structures. In this dissertation, we want to explore how to combine different tools to develop new methodologies in analyzing certain kinds of structured data, motivated by real-world problems. Multi-group design, such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI), has been undertaken by recruiting subjects based on their multi-class primary disease status, while some extensive secondary outcomes are also collected. Analysis by standard approaches is usually distorted because of the unequal sampling rates of different classes. In the first part of the dissertation, we develop a general regression framework for the analysis of secondary phenotypes collected in multi-group association studies. Our regression framework is built on a conditional model for the secondary outcome given the multi-group status and covariates and its relationship with the population regression of interest of the secondary outcome given the covariates. Then, we develop generalized estimation equations to estimate the parameters of interest. We use simulations and a large-scale imaging genetic data analysis of the ADNI data to evaluate the effect of the multi-group sampling scheme on standard genome-wide association analyses based on linear regression methods, while comparing it with our statistical methods that appropriately adjust for the multi-group sampling scheme. In the past few decades, network data has been increasingly collected and studied in diverse areas, including neuroimaging, social networks and knowledge graphs. In the second part of the dissertation, we investigate the graph-based semi-supervised learning problem with nonignorable nonresponses. We propose a Graph-based joint model with Nonignorable Missingness (GNM) and develop an imputation and inverse probability weighting estimation approach. We further use graph neural networks (GNN) to model nonlinear link functions and then use a gradient descent (GD) algorithm to estimate all the parameters of GNM. We propose a novel identifiability for the GNM model with neural network structures, and validate its predictive performance in both simulations and real data analysis through comparing with models ignoring or misspecifying the missingness mechanism. Our method can achieve up to 7.5% improvement than the baseline model for the document classification task on the Cora dataset. Predictions of Origin-Destination (OD) flow data is an important instrument in transportation studies. However, most existing methods ignore the network structure of OD flow data. In the last part of the dissertation, we propose a spatial-temporal origin-destination (STOD) model, with a novel CNN filter to learn the spatial features from the perspective of graphs and an attention mechanism to capture the long term periodicity. Experiments on a real customer request dataset with available OD information from a ride-sharing platform demonstrates the advantage of STOD in achieving a more accurate and stable prediction performance compared to some state-of-the-art methods.Doctor of Philosoph
    corecore