50,593 research outputs found

    On association in regression: the coefficient of determination revisited

    Get PDF
    Universal coefficients of determination are investigated which quantify the strength of the relation between a vector of dependent variables Y and a vector of independent covariates X. They are defined as measures of dependence between Y and X through theta(x), with theta(x) parameterizing the conditional distribution of Y given X=x. If theta(x) involves unknown coefficients gamma the definition is conditional on gamma, and in practice gamma, respectively the coefficient of determination has to be estimated. The estimates of quantities we propose generalize R^2 in classical linear regression and are also related to other definitions previously suggested. Our definitions apply to generalized regression models with arbitrary link functions as well as multivariate and nonparametric regression. The definition and use of the proposed coefficients of determination is illustrated for several regression problems with simulated and real data sets

    High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion

    Full text link
    We consider the problem of high-dimensional Gaussian graphical model selection. We identify a set of graphs for which an efficient estimation algorithm exists, and this algorithm is based on thresholding of empirical conditional covariances. Under a set of transparent conditions, we establish structural consistency (or sparsistency) for the proposed algorithm, when the number of samples n=omega(J_{min}^{-2} log p), where p is the number of variables and J_{min} is the minimum (absolute) edge potential of the graphical model. The sufficient conditions for sparsistency are based on the notion of walk-summability of the model and the presence of sparse local vertex separators in the underlying graph. We also derive novel non-asymptotic necessary conditions on the number of samples required for sparsistency

    Breaking the self-averaging properties of spatial galaxy fluctuations in the Sloan Digital Sky Survey - Data Release Six

    Full text link
    Statistical analyses of finite sample distributions usually assume that fluctuations are self-averaging, i.e. that they are statistically similar in different regions of the given sample volume. By using the scale-length method, we test whether this assumption is satisfied in several samples of the Sloan Digital Sky Survey Data Release Six. We find that the probability density function (PDF) of conditional fluctuations, filtered on large enough spatial scales (i.e., r>30 Mpc/h), shows relevant systematic variations in different sub-volumes of the survey. Instead for scales r<30 Mpc/h the PDF is statistically stable, and its first moment presents scaling behavior with a negative exponent around one. Thus while up to 30 Mpc/h galaxy structures have well-defined power-law correlations, on larger scales it is not possible to consider whole sample average quantities as meaningful and useful statistical descriptors. This situation is due to the fact that galaxy structures correspond to density fluctuations which are too large in amplitude and too extended in space to be self-averaging on such large scales inside the sample volumes: galaxy distribution is inhomogeneous up to the largest scales, i.e. r ~ 100 Mpc/h, probed by the SDSS samples. We show that cosmological corrections, as K-corrections and standard evolutionary corrections, do not qualitatively change the relevant behaviors. Finally we show that the large amplitude galaxy fluctuations observed in the SDSS samples are at odds with the predictions of the standard LCDM model of structure formation.(Abridged version).Comment: 32 pages, 28 figures, accepted for publication in Astronomy and Astrophysics. A higher resolution version is available at http://pil.phys.uniroma1.it/~sylos/fsl_highlights.html . Version v2 has been corrected to match the published on

    Nonlinear Time Series Modeling: A Unified Perspective, Algorithm, and Application

    Full text link
    A new comprehensive approach to nonlinear time series analysis and modeling is developed in the present paper. We introduce novel data-specific mid-distribution based Legendre Polynomial (LP) like nonlinear transformations of the original time series Y(t) that enables us to adapt all the existing stationary linear Gaussian time series modeling strategy and made it applicable for non-Gaussian and nonlinear processes in a robust fashion. The emphasis of the present paper is on empirical time series modeling via the algorithm LPTime. We demonstrate the effectiveness of our theoretical framework using daily S&P 500 return data between Jan/2/1963 - Dec/31/2009. Our proposed LPTime algorithm systematically discovers all the `stylized facts' of the financial time series automatically all at once, which were previously noted by many researchers one at a time.Comment: Major restructuring has been don

    Stochastic Biasing and Galaxy-Mass Density Relation in the Weakly Non-linear Regime

    Full text link
    It is believed that the biasing of the galaxies plays an important role for understanding the large-scale structure of the universe. In general, the biasing of galaxy formation could be stochastic. Furthermore, the future galaxy survey might allow us to explore the time evolution of the galaxy distribution. In this paper, the analytic study of the galaxy-mass density relation and its time evolution is presented within the framework of the stochastic biasing. In the weakly non-linear regime, we derive a general formula for the galaxy-mass density relation as a conditional mean using the Edgeworth expansion. The resulting expression contains the joint moments of the total mass and galaxy distributions. Using the perturbation theory, we investigate the time evolution of the joint moments and examine the influence of the initial stochasticity on the galaxy-mass density relation. The analysis shows that the galaxy-mass density relation could be well-approximated by the linear relation. Compared with the skewness of the galaxy distribution, we find that the estimation of the higher order moments using the conditional mean could be affected by the stochasticity. Therefore, the galaxy-mass density relation as a conditional mean should be used with a caution as a tool for estimating the skewness and the kurtosis.Comment: 22 pages, 7 Encapusulated Postscript Figures, aastex, The title and the structure of the paper has been changed, Results and conclusions unchanged, Accepted for publication in Ap

    Decentralized learning with budgeted network load using Gaussian copulas and classifier ensembles

    Get PDF
    We examine a network of learners which address the same classification task but must learn from different data sets. The learners cannot share data but instead share their models. Models are shared only one time so as to preserve the network load. We introduce DELCO (standing for Decentralized Ensemble Learning with COpulas), a new approach allowing to aggregate the predictions of the classifiers trained by each learner. The proposed method aggregates the base classifiers using a probabilistic model relying on Gaussian copulas. Experiments on logistic regressor ensembles demonstrate competing accuracy and increased robustness in case of dependent classifiers. A companion python implementation can be downloaded at https://github.com/john-klein/DELC

    Gaussian process hyper-parameter estimation using parallel asymptotically independent Markov sampling

    Get PDF
    Gaussian process emulators of computationally expensive computer codes provide fast statistical approximations to model physical processes. The training of these surrogates depends on the set of design points chosen to run the simulator. Due to computational cost, such training set is bound to be limited and quantifying the resulting uncertainty in the hyper-parameters of the emulator by uni-modal distributions is likely to induce bias. In order to quantify this uncertainty, this paper proposes a computationally efficient sampler based on an extension of Asymptotically Independent Markov Sampling, a recently developed algorithm for Bayesian inference. Structural uncertainty of the emulator is obtained as a by-product of the Bayesian treatment of the hyper-parameters. Additionally, the user can choose to perform stochastic optimisation to sample from a neighbourhood of the Maximum a Posteriori estimate, even in the presence of multimodality. Model uncertainty is also acknowledged through numerical stabilisation measures by including a nugget term in the formulation of the probability model. The efficiency of the proposed sampler is illustrated in examples where multi-modal distributions are encountered. For the purpose of reproducibility, further development, and use in other applications the code used to generate the examples is freely available for download at https://github.com/agarbuno/paims_codesComment: Computational Statistics \& Data Analysis, Volume 103, November 201
    • …
    corecore