The rapid growth of online network platforms generates large-scale network
data and it poses great challenges for statistical analysis using the spatial
autoregression (SAR) model. In this work, we develop a novel distributed
estimation and statistical inference framework for the SAR model on a
distributed system. We first propose a distributed network least squares
approximation (DNLSA) method. This enables us to obtain a one-step estimator by
taking a weighted average of local estimators on each worker. Afterwards, a
refined two-step estimation is designed to further reduce the estimation bias.
For statistical inference, we utilize a random projection method to reduce the
expensive communication cost. Theoretically, we show the consistency and
asymptotic normality of both the one-step and two-step estimators. In addition,
we provide theoretical guarantee of the distributed statistical inference
procedure. The theoretical findings and computational advantages are validated
by several numerical simulations implemented on the Spark system. Lastly, an
experiment on the Yelp dataset further illustrates the usefulness of the
proposed methodology