60 research outputs found
Distributed Logistic Regression for Massive Data with Rare Events
Large-scale rare events data are commonly encountered in practice. To tackle
the massive rare events data, we propose a novel distributed estimation method
for logistic regression in a distributed system. For a distributed framework,
we face the following two challenges. The first challenge is how to distribute
the data. In this regard, two different distribution strategies (i.e., the
RANDOM strategy and the COPY strategy) are investigated. The second challenge
is how to select an appropriate type of objective function so that the best
asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse
probability weighted (IPW) types of objective functions are considered. Our
results suggest that the COPY strategy together with the IPW objective function
is the best solution for distributed logistic regression with rare events. The
finite sample performance of the distributed methods is demonstrated by
simulation studies and a real-world Sweden Traffic Sign dataset
Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources
Modern statistical analysis often encounters datasets with large sizes. For
these datasets, conventional estimation methods can hardly be used immediately
because practitioners often suffer from limited computational resources. In
most cases, they do not have powerful computational resources (e.g., Hadoop or
Spark). How to practically analyze large datasets with limited computational
resources then becomes a problem of great importance. To solve this problem, we
propose here a novel subsampling-based method with jackknifing. The key idea is
to treat the whole sample data as if they were the population. Then, multiple
subsamples with greatly reduced sizes are obtained by the method of simple
random sampling with replacement. It is remarkable that we do not recommend
sampling methods without replacement because this would incur a significant
cost for data processing on the hard drive. Such cost does not exist if the
data are processed in memory. Because subsampled data have relatively small
sizes, they can be comfortably read into computer memory as a whole and then
processed easily. Based on subsampled datasets, jackknife-debiased estimators
can be obtained for the target parameter. The resulting estimators are
statistically consistent, with an extremely small bias. Finally, the
jackknife-debiased estimators from different subsamples are averaged together
to form the final estimator. We theoretically show that the final estimator is
consistent and asymptotically normal. Its asymptotic statistical efficiency can
be as good as that of the whole sample estimator under very mild conditions.
The proposed method is simple enough to be easily implemented on most practical
computer systems and thus should have very wide applicability
CoGANPPIS: Coevolution-enhanced Global Attention Neural Network for Protein-Protein Interaction Site Prediction
Protein-protein interactions are essential in biochemical processes. Accurate
prediction of the protein-protein interaction sites (PPIs) deepens our
understanding of biological mechanism and is crucial for new drug design.
However, conventional experimental methods for PPIs prediction are costly and
time-consuming so that many computational approaches, especially ML-based
methods, have been developed recently. Although these approaches have achieved
gratifying results, there are still two limitations: (1) Most models have
excavated some useful input features, but failed to take coevolutionary
features into account, which could provide clues for inter-residue
relationships; (2) The attention-based models only allocate attention weights
for neighboring residues, instead of doing it globally, neglecting that some
residues being far away from the target residues might also matter.
We propose a coevolution-enhanced global attention neural network, a
sequence-based deep learning model for PPIs prediction, called CoGANPPIS. It
utilizes three layers in parallel for feature extraction: (1) Local-level
representation aggregation layer, which aggregates the neighboring residues'
features; (2) Global-level representation learning layer, which employs a novel
coevolution-enhanced global attention mechanism to allocate attention weights
to all the residues on the same protein sequences; (3) Coevolutionary
information learning layer, which applies CNN & pooling to coevolutionary
information to obtain the coevolutionary profile representation. Then, the
three outputs are concatenated and passed into several fully connected layers
for the final prediction. Application on two benchmark datasets demonstrated a
state-of-the-art performance of our model. The source code is publicly
available at https://github.com/Slam1423/CoGANPPIS_source_code
Distributed Estimation and Inference for Spatial Autoregression Model with Large Scale Networks
The rapid growth of online network platforms generates large-scale network
data and it poses great challenges for statistical analysis using the spatial
autoregression (SAR) model. In this work, we develop a novel distributed
estimation and statistical inference framework for the SAR model on a
distributed system. We first propose a distributed network least squares
approximation (DNLSA) method. This enables us to obtain a one-step estimator by
taking a weighted average of local estimators on each worker. Afterwards, a
refined two-step estimation is designed to further reduce the estimation bias.
For statistical inference, we utilize a random projection method to reduce the
expensive communication cost. Theoretically, we show the consistency and
asymptotic normality of both the one-step and two-step estimators. In addition,
we provide theoretical guarantee of the distributed statistical inference
procedure. The theoretical findings and computational advantages are validated
by several numerical simulations implemented on the Spark system. Lastly, an
experiment on the Yelp dataset further illustrates the usefulness of the
proposed methodology
Group Network Hawkes Process
In this work, we study the event occurrences of individuals interacting in a
network. To characterize the dynamic interactions among the individuals, we
propose a group network Hawkes process (GNHP) model whose network structure is
observed and fixed. In particular, we introduce a latent group structure among
individuals to account for the heterogeneous user-specific characteristics. A
maximum likelihood approach is proposed to simultaneously cluster individuals
in the network and estimate model parameters. A fast EM algorithm is
subsequently developed by utilizing the branching representation of the
proposed GNHP model. Theoretical properties of the resulting estimators of
group memberships and model parameters are investigated under both settings
when the number of latent groups is over-specified or correctly specified.
A data-driven criterion that can consistently identify the true under mild
conditions is derived. Extensive simulation studies and an application to a
data set collected from Sina Weibo are used to illustrate the effectiveness of
the proposed methodology.Comment: 35 page
An Asymptotic Analysis of Minibatch-Based Momentum Methods for Linear Regression Models
Momentum methods have been shown to accelerate the convergence of the
standard gradient descent algorithm in practice and theory. In particular, the
minibatch-based gradient descent methods with momentum (MGDM) are widely used
to solve large-scale optimization problems with massive datasets. Despite the
success of the MGDM methods in practice, their theoretical properties are still
underexplored. To this end, we investigate the theoretical properties of MGDM
methods based on the linear regression models. We first study the numerical
convergence properties of the MGDM algorithm and further provide the
theoretically optimal tuning parameters specification to achieve faster
convergence rate. In addition, we explore the relationship between the
statistical properties of the resulting MGDM estimator and the tuning
parameters. Based on these theoretical findings, we give the conditions for the
resulting estimator to achieve the optimal statistical efficiency. Finally,
extensive numerical experiments are conducted to verify our theoretical
results.Comment: 45 pages, 5 figure
Modeling Social Media User Content Generation Using Interpretable Point Process Models
In this article, we study the activity patterns of modern social media users
on platforms such as Twitter and Facebook. To characterize the complex patterns
we observe in users' interactions with social media, we describe a new class of
point process models. The components in the model have straightforward
interpretations and can thus provide meaningful insights into user activity
patterns. A composite likelihood approach and a composite EM estimation
procedure are developed to overcome the challenges that arise in parameter
estimation. Using the proposed method, we analyze Donald Trump's Twitter data
and study if and how his tweeting behavior evolved before, during and after the
presidential campaign. Additionally, we analyze a large-scale social media data
from Sina Weibo and identify interesting groups of users with distinct
behaviors; in this analysis, we also discuss the effect of social ties on a
user's online content generating behavior
- …