135,252 research outputs found

    How Many Communities Are There?

    Full text link
    Stochastic blockmodels and variants thereof are among the most widely used approaches to community detection for social networks and relational data. A stochastic blockmodel partitions the nodes of a network into disjoint sets, called communities. The approach is inherently related to clustering with mixture models; and raises a similar model selection problem for the number of communities. The Bayesian information criterion (BIC) is a popular solution, however, for stochastic blockmodels, the conditional independence assumption given the communities of the endpoints among different edges is usually violated in practice. In this regard, we propose composite likelihood BIC (CL-BIC) to select the number of communities, and we show it is robust against possible misspecifications in the underlying stochastic blockmodel assumptions. We derive the requisite methodology and illustrate the approach using both simulated and real data. Supplementary materials containing the relevant computer code are available online.Comment: 26 pages, 3 figure

    A Method for Avoiding Bias from Feature Selection with Application to Naive Bayes Classification Models

    Full text link
    For many classification and regression problems, a large number of features are available for possible use - this is typical of DNA microarray data on gene expression, for example. Often, for computational or other reasons, only a small subset of these features are selected for use in a model, based on some simple measure such as correlation with the response variable. This procedure may introduce an optimistic bias, however, in which the response variable appears to be more predictable than it actually is, because the high correlation of the selected features with the response may be partly or wholely due to chance. We show how this bias can be avoided when using a Bayesian model for the joint distribution of features and response. The crucial insight is that even if we forget the exact values of the unselected features, we should retain, and condition on, the knowledge that their correlation with the response was too small for them to be selected. In this paper we describe how this idea can be implemented for ``naive Bayes'' models of binary data. Experiments with simulated data confirm that this method avoids bias due to feature selection. We also apply the naive Bayes model to subsets of data relating gene expression to colon cancer, and find that correcting for bias from feature selection does improve predictive performance

    Assessing distances and consistency of kinematics in Gaia/TGAS

    Full text link
    We apply the statistical methods by Schoenrich, Binney & Asplund to assess the quality of distances and kinematics in the RAVE-TGAS and LAMOST-TGAS samples of Solar neighbourhood stars. These methods yield a nominal distance accuracy of 1-2%. Other than common tests on parallax accuracy, they directly test distance estimations including the effects of distance priors. We show how to construct these priors including the survey selection functions (SSFs) directly from the data. We demonstrate that neglecting the SSFs causes severe distance biases. Due to the decline of the SSFs in distance, the simple 1/parallax estimate only mildly underestimates distances. We test the accuracy of measured line-of-sight velocities (v_los) by binning the samples in the nominal v_los uncertainties. We find: a) the LAMOST v_los have a ~ -5 km/s offset; b) the average LAMOST measurement error for v_los is ~7 km/s, significantly smaller than, and nearly uncorrelated with the nominal LAMOST estimates. The RAVE sample shows either a moderate distance underestimate, or an unaccounted source of v_los dispersion (e_v) from measurement errors and binary stars. For a subsample of suspected binary stars in RAVE, our methods indicate significant distance underestimates. Separating a sample in metallicity or kinematics to select thick-disc/halo stars, discriminates between distance bias and e_v. For LAMOST, this separation yields consistency with pure v_los measurement errors. We find an anomaly near longitude l~(300+/-60)deg and distance s~(0.32+/-0.03)kpc on both sides of the galactic plane, which could be explained by either a localised distance error or a breathing mode.Comment: 21 pages, 14 figures accepted by MNRAS; now also includes comparison to Astraatmadja & Bailer-Jones distance

    Real-valued feature selection for process approximation and prediction

    Get PDF
    The selection of features for classification, clustering and approximation is an important task in pattern recognition, data mining and soft computing. For real-valued features, this contribution shows how feature selection for a high number of features can be implemented using mutual in-formation. Especially, the common problem for mutual information computation of computing joint probabilities for many dimensions using only a few samples is treated by using the Rènyi mutual information of order two as computational base. For this, the Grassberger-Takens corre-lation integral is used which was developed for estimating probability densities in chaos theory. Additionally, an adaptive procedure for computing the hypercube size is introduced and for real world applications, the treatment of missing values is included. The computation procedure is accelerated by exploiting the ranking of the set of real feature values especially for the example of time series. As example, a small blackbox-glassbox example shows how the relevant features and their time lags are determined in the time series even if the input feature time series determine nonlinearly the output. A more realistic example from chemical industry shows that this enables a better ap-proximation of the input-output mapping than the best neural network approach developed for an international contest. By the computationally efficient implementation, mutual information becomes an attractive tool for feature selection even for a high number of real-valued features

    k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

    Get PDF
    Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN
    corecore