14,608 research outputs found
Statistical Tools for Network Data: Prediction and Resampling
Advances in data collection and social media have led to
more and more network data appearing in diverse areas, such
as social sciences, internet, transportation and biology.
This thesis develops new principled statistical tools for network analysis,
with emphasis on both appealing statistical properties and
computational efficiency.
Our first project focuses on building prediction models for
network-linked data. Prediction algorithms typically assume the
training data are independent samples, but in many modern applications
samples come from individuals connected by a network. For example, in
adolescent health studies of risk-taking behaviors, information on the
subjects' social network is often available and plays an important
role through network cohesion, the empirically observed phenomenon of
friends behaving similarly. Taking cohesion into account in
prediction models should allow us to improve their performance. We propose a network-based penalty on individual node effects to encourage similarity between predictions for linked nodes, and show that incorporating it into prediction leads to improvement over
traditional models both theoretically and empirically when network
cohesion is present. The penalty can be used with many loss-based
prediction methods, such as regression, generalized linear models, and
Cox's proportional hazard model. Applications to predicting levels of
recreational activity and marijuana usage among teenagers from the
AddHealth study based on both demographic covariates and friendship
networks are discussed in detail. Our approach to taking
friendships into account can significantly improve predictions of
behavior while providing interpretable estimates of covariate effects.
Resampling, data splitting, and cross-validation are powerful general strategies in statistical inference, but resampling from a network remains
a challenging problem. Many statistical models and methods for networks need model selection and tuning parameters, which could be done by cross-validation if we had a good method for splitting network data; however, splitting
network nodes into groups requires deleting edges and destroys some of
the structure. Here we propose a new network cross-validation
strategy based on splitting edges rather than nodes, which avoids
losing information and is applicable to a wide range of network
models. We provide a theoretical justification for our method in a
general setting and demonstrate how our method can be used in a
number of specific model selection and parameter tuning tasks, with extensive
numerical results on simulated networks. We also apply the method to analysis of a citation
network of statisticians and obtain meaningful research communities.
Finally, we consider the problem of community detection on partially
observed networks. However, in
practice, network data are often collected through sampling
mechanisms, such as survey questionnaires, instead of direct
observation. The noise and bias introduced by such sampling mechanisms can obscure the community structure and invalidate the assumptions of standard community detection
methods. We propose a model to
incorporate neighborhood sampling, through a model reflective of survey designs, into community detection for directed networks, since friendship networks obtained from surveys are naturally directed. We model the edge sampling probabilities as a function of both individual preferences and community parameters, and fit the model by a combination of spectral clustering and the method of
moments. The algorithm is computationally efficient and comes with a theoretical guarantee of consistency. We evaluate the proposed
model in extensive simulation studies and applied it to a
faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145894/1/tianxili_1.pd
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …