10,331 research outputs found
A clustering algorithm for multivariate data streams with correlated components
Common clustering algorithms require multiple scans of all the data to
achieve convergence, and this is prohibitive when large databases, with data
arriving in streams, must be processed. Some algorithms to extend the popular
K-means method to the analysis of streaming data are present in literature
since 1998 (Bradley et al. in Scaling clustering algorithms to large databases.
In: KDD. p. 9-15, 1998; O'Callaghan et al. in Streaming-data algorithms for
high-quality clustering. In: Proceedings of IEEE international conference on
data engineering. p. 685, 2001), based on the memorization and recursive update
of a small number of summary statistics, but they either don't take into
account the specific variability of the clusters, or assume that the random
vectors which are processed and grouped have uncorrelated components.
Unfortunately this is not the case in many practical situations. We here
propose a new algorithm to process data streams, with data having correlated
components and coming from clusters with different covariance matrices. Such
covariance matrices are estimated via an optimal double shrinkage method, which
provides positive definite estimates even in presence of a few data points, or
of data having components with small variance. This is needed to invert the
matrices and compute the Mahalanobis distances that we use for the data
assignment to the clusters. We also estimate the total number of clusters from
the data.Comment: title changed, rewritte
Fronthaul-Constrained Cloud Radio Access Networks: Insights and Challenges
As a promising paradigm for fifth generation (5G) wireless communication
systems, cloud radio access networks (C-RANs) have been shown to reduce both
capital and operating expenditures, as well as to provide high spectral
efficiency (SE) and energy efficiency (EE). The fronthaul in such networks,
defined as the transmission link between a baseband unit (BBU) and a remote
radio head (RRH), requires high capacity, but is often constrained. This
article comprehensively surveys recent advances in fronthaul-constrained
C-RANs, including system architectures and key techniques. In particular, key
techniques for alleviating the impact of constrained fronthaul on SE/EE and
quality of service for users, including compression and quantization,
large-scale coordinated processing and clustering, and resource allocation
optimization, are discussed. Open issues in terms of software-defined
networking, network function virtualization, and partial centralization are
also identified.Comment: 5 Figures, accepted by IEEE Wireless Communications. arXiv admin
note: text overlap with arXiv:1407.3855 by other author
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Improving multivariate data streams clustering.
Clustering data streams is an important task in data mining research. Recently, some algorithms have been proposed to cluster data streams as a whole, but just few of them deal with multivariate data streams. Even so, these algorithms merely aggregate the attributes without touching upon the correlation among them. In order to overcome this issue, we propose a new framework to cluster multivariate data streams based on their evolving behavior over time, exploring the correlations among their attributes by computing the fractal dimension. Experimental results with climate data streams show that the clusters' quality and compactness can be improved compared to the competing method, leading to the thoughtfulness that attributes correlations cannot be put aside. In fact, the clusters' compactness are 7 to 25 times better using our method. Our framework also proves to be an useful tool to assist meteorologists in understanding the climate behavior along a period of time.Edição dos Proceedings do 16th International Conference on Computational Science, San Diego, 2016
Framework for real-time, autonomous anomaly detection over voluminous time-series geospatial data streams, A
2014 Summer.Includes bibliographical references.In this research work we present an approach encompassing both algorithm and system design to detect anomalies in data streams. Individual observations within these streams are multidimensional, with each dimension corresponding to a feature of interest. We consider time-series geospatial datasets generated by remote and in situ observational devices. Three aspects make this problem particularly challenging: (1) the cumulative volume and rates of data arrivals, (2) anomalies evolve over time, and (3) there are spatio-temporal correlations associated with the data. Therefore, anomaly detections must be accurate and performed in real time. Given the data volumes involved, solutions must minimize user intervention and be amenable to distributed processing to ensure scalability. Our approach achieves accurate, high throughput classications in real time. We rely on Expectation Maximization (EM) to build Gaussian Mixture Models (GMMs) that model the densities of the training data. Rather than one all-encompassing model, our approach involves multiple model instances, each of which is responsible for a particular geographical extent and can also adapt as data evolves. We have incorporated these algorithms into our distributed storage platform, Galileo, and proled their suitability through empirical analysis which demonstrates high throughput (10,000 observations per-second, per-node) and low latency on real-world datasets
- …