20,107 research outputs found

    From Relational Data to Graphs: Inferring Significant Links using Generalized Hypergeometric Ensembles

    Full text link
    The inference of network topologies from relational data is an important problem in data analysis. Exemplary applications include the reconstruction of social ties from data on human interactions, the inference of gene co-expression networks from DNA microarray data, or the learning of semantic relationships based on co-occurrences of words in documents. Solving these problems requires techniques to infer significant links in noisy relational data. In this short paper, we propose a new statistical modeling framework to address this challenge. It builds on generalized hypergeometric ensembles, a class of generative stochastic models that give rise to analytically tractable probability spaces of directed, multi-edge graphs. We show how this framework can be used to assess the significance of links in noisy relational data. We illustrate our method in two data sets capturing spatio-temporal proximity relations between actors in a social system. The results show that our analytical framework provides a new approach to infer significant links from relational data, with interesting perspectives for the mining of data on social systems.Comment: 10 pages, 8 figures, accepted at SocInfo201

    Evaluating Overfit and Underfit in Models of Network Community Structure

    Full text link
    A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network's connectivity. Although many methods exist, the No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs. Thus, different algorithms will over or underfit on different inputs, finding more, fewer, or just different communities than is optimal, and evaluation methods that use a metadata partition as a ground truth will produce misleading conclusions about general accuracy. Here, we present a broad evaluation of over and underfitting in community detection, comparing the behavior of 16 state-of-the-art community detection algorithms on a novel and structurally diverse corpus of 406 real-world networks. We find that (i) algorithms vary widely both in the number of communities they find and in their corresponding composition, given the same input, (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks, and (iii) these differences induce wide variation in accuracy on link prediction and link description tasks. We introduce a new diagnostic for evaluating overfitting and underfitting in practice, and use it to roughly divide community detection methods into general and specialized learning algorithms. Across methods and inputs, Bayesian techniques based on the stochastic block model and a minimum description length approach to regularization represent the best general learning approach, but can be outperformed under specific circumstances. These results introduce both a theoretically principled approach to evaluate over and underfitting in models of network community structure and a realistic benchmark by which new methods may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table

    Monte Carlo optimization approach for decentralized estimation networks under communication constraints

    Get PDF
    We consider designing decentralized estimation schemes over bandwidth limited communication links with a particular interest in the tradeoff between the estimation accuracy and the cost of communications due to, e.g., energy consumption. We take two classes of in–network processing strategies into account which yield graph representations through modeling the sensor platforms as the vertices and the communication links by edges as well as a tractable Bayesian risk that comprises the cost of transmissions and penalty for the estimation errors. This approach captures a broad range of possibilities for “online” processing of observations as well as the constraints imposed and enables a rigorous design setting in the form of a constrained optimization problem. Similar schemes as well as the structures exhibited by the solutions to the design problem has been studied previously in the context of decentralized detection. Under reasonable assumptions, the optimization can be carried out in a message passing fashion. We adopt this framework for estimation, however, the corresponding optimization schemes involve integral operators that cannot be evaluated exactly in general. We develop an approximation framework using Monte Carlo methods and obtain particle representations and approximate computational schemes for both classes of in–network processing strategies and their optimization. The proposed Monte Carlo optimization procedures operate in a scalable and efficient fashion and, owing to the non-parametric nature, can produce results for any distributions provided that samples can be produced from the marginals. In addition, this approach exhibits graceful degradation of the estimation accuracy asymptotically as the communication becomes more costly, through a parameterized Bayesian risk

    Catalog Matching with Astrometric Correction and its Application to the Hubble Legacy Archive

    Full text link
    Object cross-identification in multiple observations is often complicated by the uncertainties in their astrometric calibration. Due to the lack of standard reference objects, an image with a small field of view can have significantly larger errors in its absolute positioning than the relative precision of the detected sources within. We present a new general solution for the relative astrometry that quickly refines the World Coordinate System of overlapping fields. The efficiency is obtained through the use of infinitesimal 3-D rotations on the celestial sphere, which do not involve trigonometric functions. They also enable an analytic solution to an important step in making the astrometric corrections. In cases with many overlapping images, the correct identification of detections that match together across different images is difficult to determine. We describe a new greedy Bayesian approach for selecting the best object matches across a large number of overlapping images. The methods are developed and demonstrated on the Hubble Legacy Archive, one of the most challenging data sets today. We describe a novel catalog compiled from many Hubble Space Telescope observations, where the detections are combined into a searchable collection of matches that link the individual detections. The matches provide descriptions of astronomical objects involving multiple wavelengths and epochs. High relative positional accuracy of objects is achieved across the Hubble images, often sub-pixel precision in the order of just a few milli-arcseconds. The result is a reliable set of high-quality associations that are publicly available online.Comment: 9 pages, 9 figures, accepted for publication in the Astrophysical Journa

    Monte Carlo optimization approach for decentralized estimation networks under communication constraints

    Get PDF
    We consider designing decentralized estimation schemes over bandwidth limited communication links with a particular interest in the tradeoff between the estimation accuracy and the cost of communications due to, e.g., energy consumption. We take two classes of in–network processing strategies into account which yield graph representations through modeling the sensor platforms as the vertices and the communication links by edges as well as a tractable Bayesian risk that comprises the cost of transmissions and penalty for the estimation errors. This approach captures a broad range of possibilities for “online” processing of observations as well as the constraints imposed and enables a rigorous design setting in the form of a constrained optimization problem. Similar schemes as well as the structures exhibited by the solutions to the design problem has been studied previously in the context of decentralized detection. Under reasonable assumptions, the optimization can be carried out in a message passing fashion. We adopt this framework for estimation, however, the corresponding optimization schemes involve integral operators that cannot be evaluated exactly in general. We develop an approximation framework using Monte Carlo methods and obtain particle representations and approximate computational schemes for both classes of in–network processing strategies and their optimization. The proposed Monte Carlo optimization procedures operate in a scalable and efficient fashion and, owing to the non-parametric nature, can produce results for any distributions provided that samples can be produced from the marginals. In addition, this approach exhibits graceful degradation of the estimation accuracy asymptotically as the communication becomes more costly, through a parameterized Bayesian risk
    • …
    corecore