10,331 research outputs found

    A clustering algorithm for multivariate data streams with correlated components

    Get PDF
    Common clustering algorithms require multiple scans of all the data to achieve convergence, and this is prohibitive when large databases, with data arriving in streams, must be processed. Some algorithms to extend the popular K-means method to the analysis of streaming data are present in literature since 1998 (Bradley et al. in Scaling clustering algorithms to large databases. In: KDD. p. 9-15, 1998; O'Callaghan et al. in Streaming-data algorithms for high-quality clustering. In: Proceedings of IEEE international conference on data engineering. p. 685, 2001), based on the memorization and recursive update of a small number of summary statistics, but they either don't take into account the specific variability of the clusters, or assume that the random vectors which are processed and grouped have uncorrelated components. Unfortunately this is not the case in many practical situations. We here propose a new algorithm to process data streams, with data having correlated components and coming from clusters with different covariance matrices. Such covariance matrices are estimated via an optimal double shrinkage method, which provides positive definite estimates even in presence of a few data points, or of data having components with small variance. This is needed to invert the matrices and compute the Mahalanobis distances that we use for the data assignment to the clusters. We also estimate the total number of clusters from the data.Comment: title changed, rewritte

    Fronthaul-Constrained Cloud Radio Access Networks: Insights and Challenges

    Full text link
    As a promising paradigm for fifth generation (5G) wireless communication systems, cloud radio access networks (C-RANs) have been shown to reduce both capital and operating expenditures, as well as to provide high spectral efficiency (SE) and energy efficiency (EE). The fronthaul in such networks, defined as the transmission link between a baseband unit (BBU) and a remote radio head (RRH), requires high capacity, but is often constrained. This article comprehensively surveys recent advances in fronthaul-constrained C-RANs, including system architectures and key techniques. In particular, key techniques for alleviating the impact of constrained fronthaul on SE/EE and quality of service for users, including compression and quantization, large-scale coordinated processing and clustering, and resource allocation optimization, are discussed. Open issues in terms of software-defined networking, network function virtualization, and partial centralization are also identified.Comment: 5 Figures, accepted by IEEE Wireless Communications. arXiv admin note: text overlap with arXiv:1407.3855 by other author

    Multivariate Correlation Discovery in Streaming Data

    Get PDF

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Improving multivariate data streams clustering.

    Get PDF
    Clustering data streams is an important task in data mining research. Recently, some algorithms have been proposed to cluster data streams as a whole, but just few of them deal with multivariate data streams. Even so, these algorithms merely aggregate the attributes without touching upon the correlation among them. In order to overcome this issue, we propose a new framework to cluster multivariate data streams based on their evolving behavior over time, exploring the correlations among their attributes by computing the fractal dimension. Experimental results with climate data streams show that the clusters' quality and compactness can be improved compared to the competing method, leading to the thoughtfulness that attributes correlations cannot be put aside. In fact, the clusters' compactness are 7 to 25 times better using our method. Our framework also proves to be an useful tool to assist meteorologists in understanding the climate behavior along a period of time.Edição dos Proceedings do 16th International Conference on Computational Science, San Diego, 2016

    Framework for real-time, autonomous anomaly detection over voluminous time-series geospatial data streams, A

    Get PDF
    2014 Summer.Includes bibliographical references.In this research work we present an approach encompassing both algorithm and system design to detect anomalies in data streams. Individual observations within these streams are multidimensional, with each dimension corresponding to a feature of interest. We consider time-series geospatial datasets generated by remote and in situ observational devices. Three aspects make this problem particularly challenging: (1) the cumulative volume and rates of data arrivals, (2) anomalies evolve over time, and (3) there are spatio-temporal correlations associated with the data. Therefore, anomaly detections must be accurate and performed in real time. Given the data volumes involved, solutions must minimize user intervention and be amenable to distributed processing to ensure scalability. Our approach achieves accurate, high throughput classications in real time. We rely on Expectation Maximization (EM) to build Gaussian Mixture Models (GMMs) that model the densities of the training data. Rather than one all-encompassing model, our approach involves multiple model instances, each of which is responsible for a particular geographical extent and can also adapt as data evolves. We have incorporated these algorithms into our distributed storage platform, Galileo, and proled their suitability through empirical analysis which demonstrates high throughput (10,000 observations per-second, per-node) and low latency on real-world datasets
    corecore