50,593 research outputs found
On association in regression: the coefficient of determination revisited
Universal coefficients of determination are investigated which quantify the strength of the relation between a vector of dependent variables Y and a vector of independent covariates X. They are defined as measures of dependence between Y and X through theta(x), with theta(x) parameterizing the conditional distribution of Y given X=x. If theta(x) involves unknown coefficients gamma the definition is conditional on gamma, and in practice gamma, respectively the coefficient of determination has to be estimated. The estimates of quantities we propose generalize R^2 in classical linear regression and are also related to other definitions previously suggested. Our definitions apply to generalized regression models with arbitrary link functions as well as multivariate and nonparametric regression. The definition and use of the proposed coefficients of determination is illustrated for several regression problems with simulated and real data sets
High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
We consider the problem of high-dimensional Gaussian graphical model
selection. We identify a set of graphs for which an efficient estimation
algorithm exists, and this algorithm is based on thresholding of empirical
conditional covariances. Under a set of transparent conditions, we establish
structural consistency (or sparsistency) for the proposed algorithm, when the
number of samples n=omega(J_{min}^{-2} log p), where p is the number of
variables and J_{min} is the minimum (absolute) edge potential of the graphical
model. The sufficient conditions for sparsistency are based on the notion of
walk-summability of the model and the presence of sparse local vertex
separators in the underlying graph. We also derive novel non-asymptotic
necessary conditions on the number of samples required for sparsistency
Breaking the self-averaging properties of spatial galaxy fluctuations in the Sloan Digital Sky Survey - Data Release Six
Statistical analyses of finite sample distributions usually assume that
fluctuations are self-averaging, i.e. that they are statistically similar in
different regions of the given sample volume. By using the scale-length method,
we test whether this assumption is satisfied in several samples of the Sloan
Digital Sky Survey Data Release Six. We find that the probability density
function (PDF) of conditional fluctuations, filtered on large enough spatial
scales (i.e., r>30 Mpc/h), shows relevant systematic variations in different
sub-volumes of the survey. Instead for scales r<30 Mpc/h the PDF is
statistically stable, and its first moment presents scaling behavior with a
negative exponent around one. Thus while up to 30 Mpc/h galaxy structures have
well-defined power-law correlations, on larger scales it is not possible to
consider whole sample average quantities as meaningful and useful statistical
descriptors. This situation is due to the fact that galaxy structures
correspond to density fluctuations which are too large in amplitude and too
extended in space to be self-averaging on such large scales inside the sample
volumes: galaxy distribution is inhomogeneous up to the largest scales, i.e. r
~ 100 Mpc/h, probed by the SDSS samples. We show that cosmological corrections,
as K-corrections and standard evolutionary corrections, do not qualitatively
change the relevant behaviors. Finally we show that the large amplitude galaxy
fluctuations observed in the SDSS samples are at odds with the predictions of
the standard LCDM model of structure formation.(Abridged version).Comment: 32 pages, 28 figures, accepted for publication in Astronomy and
Astrophysics. A higher resolution version is available at
http://pil.phys.uniroma1.it/~sylos/fsl_highlights.html . Version v2 has been
corrected to match the published on
Nonlinear Time Series Modeling: A Unified Perspective, Algorithm, and Application
A new comprehensive approach to nonlinear time series analysis and modeling
is developed in the present paper. We introduce novel data-specific
mid-distribution based Legendre Polynomial (LP) like nonlinear transformations
of the original time series Y(t) that enables us to adapt all the existing
stationary linear Gaussian time series modeling strategy and made it applicable
for non-Gaussian and nonlinear processes in a robust fashion. The emphasis of
the present paper is on empirical time series modeling via the algorithm
LPTime. We demonstrate the effectiveness of our theoretical framework using
daily S&P 500 return data between Jan/2/1963 - Dec/31/2009. Our proposed LPTime
algorithm systematically discovers all the `stylized facts' of the financial
time series automatically all at once, which were previously noted by many
researchers one at a time.Comment: Major restructuring has been don
Stochastic Biasing and Galaxy-Mass Density Relation in the Weakly Non-linear Regime
It is believed that the biasing of the galaxies plays an important role for
understanding the large-scale structure of the universe. In general, the
biasing of galaxy formation could be stochastic. Furthermore, the future galaxy
survey might allow us to explore the time evolution of the galaxy distribution.
In this paper, the analytic study of the galaxy-mass density relation and its
time evolution is presented within the framework of the stochastic biasing. In
the weakly non-linear regime, we derive a general formula for the galaxy-mass
density relation as a conditional mean using the Edgeworth expansion. The
resulting expression contains the joint moments of the total mass and galaxy
distributions. Using the perturbation theory, we investigate the time evolution
of the joint moments and examine the influence of the initial stochasticity on
the galaxy-mass density relation. The analysis shows that the galaxy-mass
density relation could be well-approximated by the linear relation. Compared
with the skewness of the galaxy distribution, we find that the estimation of
the higher order moments using the conditional mean could be affected by the
stochasticity. Therefore, the galaxy-mass density relation as a conditional
mean should be used with a caution as a tool for estimating the skewness and
the kurtosis.Comment: 22 pages, 7 Encapusulated Postscript Figures, aastex, The title and
the structure of the paper has been changed, Results and conclusions
unchanged, Accepted for publication in Ap
Decentralized learning with budgeted network load using Gaussian copulas and classifier ensembles
We examine a network of learners which address the same classification task
but must learn from different data sets. The learners cannot share data but
instead share their models. Models are shared only one time so as to preserve
the network load. We introduce DELCO (standing for Decentralized Ensemble
Learning with COpulas), a new approach allowing to aggregate the predictions of
the classifiers trained by each learner. The proposed method aggregates the
base classifiers using a probabilistic model relying on Gaussian copulas.
Experiments on logistic regressor ensembles demonstrate competing accuracy and
increased robustness in case of dependent classifiers. A companion python
implementation can be downloaded at https://github.com/john-klein/DELC
Gaussian process hyper-parameter estimation using parallel asymptotically independent Markov sampling
Gaussian process emulators of computationally expensive computer codes
provide fast statistical approximations to model physical processes. The
training of these surrogates depends on the set of design points chosen to run
the simulator. Due to computational cost, such training set is bound to be
limited and quantifying the resulting uncertainty in the hyper-parameters of
the emulator by uni-modal distributions is likely to induce bias. In order to
quantify this uncertainty, this paper proposes a computationally efficient
sampler based on an extension of Asymptotically Independent Markov Sampling, a
recently developed algorithm for Bayesian inference. Structural uncertainty of
the emulator is obtained as a by-product of the Bayesian treatment of the
hyper-parameters. Additionally, the user can choose to perform stochastic
optimisation to sample from a neighbourhood of the Maximum a Posteriori
estimate, even in the presence of multimodality. Model uncertainty is also
acknowledged through numerical stabilisation measures by including a nugget
term in the formulation of the probability model. The efficiency of the
proposed sampler is illustrated in examples where multi-modal distributions are
encountered. For the purpose of reproducibility, further development, and use
in other applications the code used to generate the examples is freely
available for download at https://github.com/agarbuno/paims_codesComment: Computational Statistics \& Data Analysis, Volume 103, November 201
- …