1,763 research outputs found

    Spectral embedding of weighted graphs

    Full text link
    This paper concerns the statistical analysis of a weighted graph through spectral embedding. Under a latent position model in which the expected adjacency matrix has low rank, we prove uniform consistency and a central limit theorem for the embedded nodes, treated as latent position estimates. In the special case of a weighted stochastic block model, this result implies that the embedding follows a Gaussian mixture model with each component representing a community. We exploit this to formally evaluate different weight representations of the graph using Chernoff information. For example, in a network anomaly detection problem where we observe a p-value on each edge, we recommend against directly embedding the matrix of p-values, and instead using threshold or log p-values, depending on network sparsity and signal strength.Comment: 29 pages, 8 figure

    SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets

    Get PDF
    This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Unlike the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. A temporary clustering model is built at the first phase; then, it is gradually updated by analyzing consecutive memory loads of points. Subsequently, at the end of scalable clustering, the approximate structure of the original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk.Comment: Highlights are shortened each to about 85 character

    A temporal analysis system for early detection of health changes

    Get PDF
    Abstract from public.pdf.To make it possible for elders to live independently at home and yet get help from health care providers when small changes in health conditions take place, smart home technologies are developed to enhance safety and monitor health conditions via noninvasive sensors and other devices. To better analyze the wealth of the activity information from various kinds of sensors to locate trends that correspond states of wellbeing, this thesis proposes a new system to build adaptive models for detecting health changes based on temporal analysis, including outlier detection, customization and adaption to new changes. Our hope is that by using more sophisticated temporal analysis method we can capture more predictive alerts and more customized alerts that can help us detect more meaningful health changes before they become big problems. Since we cannot have full access to all the embedded sensor data from TigerPlace at the moment, the system is tested using synthetic datasets which simulate gradual changes, sudden changes, changes of baseline health condition and system noise that might happen in the real-world data. Based on the experiments on the synthetic datasets, the system is proved to have the ability to adapt to gradual changes, find anomalies and spawn a new component for the GMM when there is an emerging new normal pattern. The system achieves our goals when tested on the synthetic datasets over extended period of time. We hope that by using the system in Tiger Place, it will help by detecting health changes before real health issue happens

    A non-parametric Bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects

    Get PDF
    BACKGROUND: Flow cytometry (FC)-based computer-aided diagnostics is an emerging technique utilizing modern multiparametric cytometry systems.The major difficulty in using machine-learning approaches for classification of FC data arises from limited access to a wide variety of anomalous samples for training. In consequence, any learning with an abundance of normal cases and a limited set of specific anomalous cases is biased towards the types of anomalies represented in the training set. Such models do not accurately identify anomalies, whether previously known or unknown, that may exist in future samples tested. Although one-class classifiers trained using only normal cases would avoid such a bias, robust sample characterization is critical for a generalizable model. Owing to sample heterogeneity and instrumental variability, arbitrary characterization of samples usually introduces feature noise that may lead to poor predictive performance. Herein, we present a non-parametric Bayesian algorithm called ASPIRE (anomalous sample phenotype identification with random effects) that identifies phenotypic differences across a batch of samples in the presence of random effects. Our approach involves simultaneous clustering of cellular measurements in individual samples and matching of discovered clusters across all samples in order to recover global clusters using probabilistic sampling techniques in a systematic way. RESULTS: We demonstrate the performance of the proposed method in identifying anomalous samples in two different FC data sets, one of which represents a set of samples including acute myeloid leukemia (AML) cases, and the other a generic 5-parameter peripheral-blood immunophenotyping. Results are evaluated in terms of the area under the receiver operating characteristics curve (AUC). ASPIRE achieved AUCs of 0.99 and 1.0 on the AML and generic blood immunophenotyping data sets, respectively. CONCLUSIONS: These results demonstrate that anomalous samples can be identified by ASPIRE with almost perfect accuracy without a priori access to samples of anomalous subtypes in the training set. The ASPIRE approach is unique in its ability to form generalizations regarding normal and anomalous states given only very weak assumptions regarding sample characteristics and origin. Thus, ASPIRE could become highly instrumental in providing unique insights about observed biological phenomena in the absence of full information about the investigated samples
    • …
    corecore