8,395 research outputs found
Dynamic feature selection for clustering high dimensional data streams
open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked
Finding and tracking multi-density clusters in an online dynamic data stream
The file attached to this record is the author's final peer reviewed version.Change is one of the biggest challenges in dynamic stream mining. From a data-mining perspective, adapting and tracking change is desirable in order to understand how and why change has occurred. Clustering, a form of unsupervised learning, can be used to identify the underlying patterns in a stream. Density-based clustering identifies clusters as areas of high density separated by areas of low density. This paper proposes a Multi-Density Stream Clustering (MDSC) algorithm to address these two problems; the multi-density problem and the problem of discovering and tracking changes in a dynamic stream. MDSC consists of two on-line components; discovered, labelled clusters and an outlier buffer. Incoming points are assigned to a live cluster or passed to the outlier buffer. New clusters are discovered in the buffer using an ant-inspired swarm intelligence approach. The newly discovered cluster is uniquely labelled and added to the set of live clusters. Processed data is subject to an ageing function and will disappear when it is no longer relevant. MDSC is shown to perform favourably to state-of-the-art peer stream-clustering algorithms on a range of real and synthetic data-streams. Experimental results suggest that MDSC can discover qualitatively useful patterns while being scalable and robust to noise
rEMM: Extensible Markov Model for Data Stream Clustering in R
Clustering streams of continuously arriving data has become an important application of data mining in recent years and efficient algorithms have been proposed by several researchers. However, clustering alone neglects the fact that data in a data stream is not only characterized by the proximity of data points which is used by clustering, but also by a temporal component. The extensible Markov model (EMM) adds the temporal component to data stream clustering by superimposing a dynamically adapting Markov chain. In this paper we introduce the implementation of the R extension package rEMM which implements EMM and we discuss some examples and applications.
Evaluating the Differences of Gridding Techniques for Digital Elevation Models Generation and Their Influence on the Modeling of Stony Debris Flows Routing: A Case Study From Rovina di Cancia Basin (North-Eastern Italian Alps)
Debris \ufb02ows are among the most hazardous phenomena in mountain areas. To cope
with debris \ufb02ow hazard, it is common to delineate the risk-prone areas through
routing models. The most important input to debris \ufb02ow routing models are the
topographic data, usually in the form of Digital Elevation Models (DEMs). The quality
of DEMs depends on the accuracy, density, and spatial distribution of the sampled
points; on the characteristics of the surface; and on the applied gridding methodology.
Therefore, the choice of the interpolation method affects the realistic representation
of the channel and fan morphology, and thus potentially the debris \ufb02ow routing
modeling outcomes. In this paper, we initially investigate the performance of common
interpolation methods (i.e., linear triangulation, natural neighbor, nearest neighbor,
Inverse Distance to a Power, ANUDEM, Radial Basis Functions, and ordinary kriging)
in building DEMs with the complex topography of a debris \ufb02ow channel located
in the Venetian Dolomites (North-eastern Italian Alps), by using small footprint full-
waveform Light Detection And Ranging (LiDAR) data. The investigation is carried
out through a combination of statistical analysis of vertical accuracy, algorithm
robustness, and spatial clustering of vertical errors, and multi-criteria shape reliability
assessment. After that, we examine the in\ufb02uence of the tested interpolation algorithms
on the performance of a Geographic Information System (GIS)-based cell model for
simulating stony debris \ufb02ows routing. In detail, we investigate both the correlation
between the DEMs heights uncertainty resulting from the gridding procedure and
that on the corresponding simulated erosion/deposition depths, both the effect of
interpolation algorithms on simulated areas, erosion and deposition volumes, solid-liquid
discharges, and channel morphology after the event. The comparison among the tested
interpolation methods highlights that the ANUDEM and ordinary kriging algorithms
are not suitable for building DEMs with complex topography. Conversely, the linear
triangulation, the natural neighbor algorithm, and the thin-plate spline plus tension and completely regularized spline functions ensure the best trade-off among accuracy
and shape reliability. Anyway, the evaluation of the effects of gridding techniques on
debris \ufb02ow routing modeling reveals that the choice of the interpolation algorithm does
not signi\ufb01cantly affect the model outcomes
Modeling the gravitational potential of a cosmological dark matter halo with stellar streams
Stellar streams result from the tidal disruption of satellites and star
clusters as they orbit a host galaxy, and can be very sensitive probes of the
gravitational potential of the host system. We select and study narrow stellar
streams formed in a Milky-Way-like dark matter halo of the Aquarius suite of
cosmological simulations, to determine if these streams can be used to
constrain the present day characteristic parameters of the halo's gravitational
potential. We find that orbits integrated in static spherical and triaxial NFW
potentials both reproduce the locations and kinematics of the various streams
reasonably well. To quantify this further, we determine the best-fit potential
parameters by maximizing the amount of clustering of the stream stars in the
space of their actions. We show that using our set of Aquarius streams, we
recover a mass profile that is consistent with the spherically-averaged dark
matter profile of the host halo, although we ignored both triaxiality and time
evolution in the fit. This gives us confidence that such methods can be applied
to the many streams that will be discovered by the Gaia mission to determine
the gravitational potential of our Galaxy.Comment: ApJ sub
- …