9,046 research outputs found
Variable selection and regression analysis for graph-structured covariates with an application to genomics
Graphs and networks are common ways of depicting biological information. In
biology, many different biological processes are represented by graphs, such as
regulatory networks, metabolic pathways and protein--protein interaction
networks. This kind of a priori use of graphs is a useful supplement to the
standard numerical data such as microarray gene expression data. In this paper
we consider the problem of regression analysis and variable selection when the
covariates are linked on a graph. We study a graph-constrained regularization
procedure and its theoretical properties for regression analysis to take into
account the neighborhood information of the variables measured on a graph. This
procedure involves a smoothness penalty on the coefficients that is defined as
a quadratic form of the Laplacian matrix associated with the graph. We
establish estimation and model selection consistency results and provide
estimation bounds for both fixed and diverging numbers of parameters in
regression models. We demonstrate by simulations and a real data set that the
proposed procedure can lead to better variable selection and prediction than
existing methods that ignore the graph information associated with the
covariates.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS332 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A hierarchical Bayesian model for predicting ecological interactions using scaled evolutionary relationships
Identifying undocumented or potential future interactions among species is a
challenge facing modern ecologists. Recent link prediction methods rely on
trait data, however large species interaction databases are typically sparse
and covariates are limited to only a fraction of species. On the other hand,
evolutionary relationships, encoded as phylogenetic trees, can act as proxies
for underlying traits and historical patterns of parasite sharing among hosts.
We show that using a network-based conditional model, phylogenetic information
provides strong predictive power in a recently published global database of
host-parasite interactions. By scaling the phylogeny using an evolutionary
model, our method allows for biological interpretation often missing from
latent variable models. To further improve on the phylogeny-only model, we
combine a hierarchical Bayesian latent score framework for bipartite graphs
that accounts for the number of interactions per species with the host
dependence informed by phylogeny. Combining the two information sources yields
significant improvement in predictive accuracy over each of the submodels
alone. As many interaction networks are constructed from presence-only data, we
extend the model by integrating a correction mechanism for missing
interactions, which proves valuable in reducing uncertainty in unobserved
interactions.Comment: To appear in the Annals of Applied Statistic
Unifying Amplitude and Phase Analysis: A Compositional Data Approach to Functional Multivariate Mixed-Effects Modeling of Mandarin Chinese
Mandarin Chinese is characterized by being a tonal language; the pitch (or
) of its utterances carries considerable linguistic information. However,
speech samples from different individuals are subject to changes in amplitude
and phase which must be accounted for in any analysis which attempts to provide
a linguistically meaningful description of the language. A joint model for
amplitude, phase and duration is presented which combines elements from
Functional Data Analysis, Compositional Data Analysis and Linear Mixed Effects
Models. By decomposing functions via a functional principal component analysis,
and connecting registration functions to compositional data analysis, a joint
multivariate mixed effect model can be formulated which gives insights into the
relationship between the different modes of variation as well as their
dependence on linguistic and non-linguistic covariates. The model is applied to
the COSPRO-1 data set, a comprehensive database of spoken Taiwanese Mandarin,
containing approximately 50 thousand phonetically diverse sample contours
(syllables), and reveals that phonetic information is jointly carried by both
amplitude and phase variation.Comment: 49 pages, 13 figures, small changes to discussio
- …