28 research outputs found

    Dynamic feature selection for clustering high dimensional data streams

    Get PDF
    open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked

    Finding and tracking multi-density clusters in an online dynamic data stream

    Get PDF
    The file attached to this record is the author's final peer reviewed version.Change is one of the biggest challenges in dynamic stream mining. From a data-mining perspective, adapting and tracking change is desirable in order to understand how and why change has occurred. Clustering, a form of unsupervised learning, can be used to identify the underlying patterns in a stream. Density-based clustering identifies clusters as areas of high density separated by areas of low density. This paper proposes a Multi-Density Stream Clustering (MDSC) algorithm to address these two problems; the multi-density problem and the problem of discovering and tracking changes in a dynamic stream. MDSC consists of two on-line components; discovered, labelled clusters and an outlier buffer. Incoming points are assigned to a live cluster or passed to the outlier buffer. New clusters are discovered in the buffer using an ant-inspired swarm intelligence approach. The newly discovered cluster is uniquely labelled and added to the set of live clusters. Processed data is subject to an ageing function and will disappear when it is no longer relevant. MDSC is shown to perform favourably to state-of-the-art peer stream-clustering algorithms on a range of real and synthetic data-streams. Experimental results suggest that MDSC can discover qualitatively useful patterns while being scalable and robust to noise

    Learning in Dynamic Data-Streams with a Scarcity of Labels

    Get PDF
    Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on top’’ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)

    The Financial Crisis and the Changing Profile of Mortgage Arrears in Ireland. ESRI Research Notes 2014/4/2

    Get PDF
    Understanding which households go into mortgage arrears during both boom and bust periods in Ireland is of critical importance to ensure suitable policies are deployed to safeguard future financial stability. Many of the difficulties in Ireland arose from the loosening of underwriting standards by financial institutions. This led to excessive household leverage ratios and provided households with limited buffers with which to absorb shocks (McCarthy and McQuinn, 2017; Lydon and McCann, 2017). The joint effects of labour market difficulties and large falls in house prices led to a situation where nearly one-in-five mortgage loans was in arrears at the height of the crisis (McCarthy, 2014)

    Finding multi-density clusters in non-stationary data streams using an ant colony with adaptive parameters

    Get PDF
    The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.Density based methods have been shown to be an effective approach for clustering non-stationary data streams. The number of clusters does not need to be known a priori and density methods are robust to noise and changes in the statistical properties of the data. However, most density approaches require sensitive, data dependent parameters. These parameters greatly affect the clustering performance and in a dynamic stream a good set of parameters at time t are not necessarily the best at time t+1. Furthermore, these parameters are global and so restrict the algorithm to finding clusters of the same density. In this paper, we propose a density based algorithm with adaptive parameters which are local to each discovered cluster. The algorithm, denoted Ant Colony Multi-Density Clustering (ACMDC), uses artificial ants to form nests in dense areas of the data. As the ants move between nests, their collective memory is stored in the form of pheromone trails. Clusters are identified as groups of similar nests. The proposed algorithm is evaluated across a number of synthetic data streams containing overlapping and embedded multi-density clusters. The performance of the algorithm is shown to be favourable to a leading density based stream-clustering algorithm despite requiring no tunable parameters

    A Multi-Agent System for Modelling the Spread of Lethal Wilt in Oil-Palm Plantations

    Get PDF
    Lethal Wilt (Marchitez Letal) is a disease which affects Elaeis Guineensis, a plant used in the production of palm oil. The disease is increasingly common but the spatial dynamics of the infection spread remain poorly understood. It is particularly dangerous due to the speed at which it spreads and the speed at which infected plants show symptoms and die. Early identification, or even better, accurate prediction of areas at high risk of infection can slow the spread of the disease and limit crop waste. This study is based on data collected over a five-year period from an affected plantation in Colombia. The aim of the study is to analyse the collected data to better understand how the disease spreads and then to model the behaviour. Based on insights from the initial analysis a multi-agent-based system is proposed to model the pattern of infection. The model is comprised of two steps; first Kernel Density Estimation is used to create an estimation of the distribution from which newly infected plants are drawn and this density estimation is then used to direct agents on a biased-walk of the surrounding areas. Results show that the model can approximate the behaviour of the disease and can predict areas which are at high risk of future infection

    Monetary policy normalisation and mortgage arrears in a recovering economy: The case of the Irish residential market. ESRI WP613, March 2019

    Get PDF
    In this paper we examine the sensitivity of mortgage arrears for Irish households to changes in mortgage interest rates under a series of plausible monetary policy normalisation scenarios. Using panel data over the period 2004 - 2016 we exploit information on current income and current mortgage repayments to link arrears to the level of, as well as shocks in, households' current debt service ratio. In doing so we address gaps in the existing literature on modelling default and stress testing. Both are found to be strong drivers of arrears indicating the level of indebtedness, as well as changes to repayment capacity, matter for households. We find that a 100 basis point increase in policy rates would lead to a 0.5 percentage point increase in new default flows. We also test for heterogeneous effects across households and find younger, low income households and those on tracker mortgage rate loans are most at risk following rate rises. This has important consequences for the distributional impacts of monetary policy

    Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams

    Get PDF
    A data stream is a continuously arriving sequence of data and clustering data streams requires additional considerations to traditional clustering. A stream is potentially unbounded, data points arrive on-line and each data point can be examined only once. This imposes limitations on available memory and processing time. Furthermore, streams can be noisy and the number of clusters in the data and their statistical properties can change over time. This paper presents an on-line, bio-inspired approach to clustering dynamic data streams. The proposed Ant-Colony Stream Clustering (ACSC) algorithm is a density based clustering algorithm, whereby clusters are identified as high-density areas of the feature space separated by low-density areas. ACSC identifies clusters as groups of micro-clusters. The tumbling window model is used to read a stream and rough clusters are incrementally formed during a single pass of a window. A stochastic method is employed to find these rough clusters, this is shown to significantly speed the algorithm with only a minor cost to performance, as compared to a deterministic approach. The rough clusters are then refined using a method inspired by the observed sorting behaviour of ants. Ants pick-up and drop items based on the similarity with the surrounding items. Artificial ants sort clusters by probabilistically picking and dropping micro-clusters based on local density and local similarity. Clusters are summarised using their constituent micro-clusters and these summary statistics are stored offline. Experimental results show that the clustering quality of ACSC is scalable, robust to noise and favourable to leading ant-clustering and stream-clustering algorithms. It also requires fewer parameters and less computational time

    Multiview subspace clustering using low-rank representation

    Get PDF
    The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.Multiview subspace clustering is one of the most widely used methods for exploiting the internal structures of multiview data. Most previous studies have performed the task of learning multiview representations by individually constructing an affinity matrix for each view without simultaneously exploiting the intrinsic characteristics of multiview data. In this paper, we propose a multiview low-rank representation (MLRR) method to comprehensively discover the correlation of multiview data for multiview subspace clustering. MLRR considers symmetric low-rank representations (LRRs) to be an approximately linear spatial transformation under the new base, i.e., the multiview data themselves, to fully exploit the angular information of the principal directions of LRRs, which is adopted to construct an affinity matrix for multiview subspace clustering, under a symmetric condition. MLRR takes full advantage of LRR techniques and a diversity regularization term to exploit the diversity and consistency of multiple views, respectively, and this method simultaneously imposes a symmetry constraint on LRRs. Hence, the angular information of the principal directions of rows is consistent with that of columns in symmetric LRRs. The MLRR model can be efficiently calculated by solving a convex optimization problem. Moreover, we present an intuitive fusion strategy for symmetric LRRs from the perspective of spectral clustering to obtain a compact representation, which can be shared by multiple views and comprehensively represents the intrinsic features of multiview data. Finally, the experimental results based on benchmark datasets demonstrate the effectiveness and robustness of MLRR compared with several state-of-the-art multiview subspace clustering algorithms

    Historical Data Trend Analysis in Extended Reality Education Field

    Get PDF
    The arrival of the digital age brings Virtual Reality, Augmented Reality, and Mixed Reality technologies into our daily life. It provides a brand-new user experience to composite with real environments. Due to the development of related devices in recent years, the highly interactive connections between users and devices have gradually evolved. The paper starts from a literature review to discuss Virtual Reality, Augmented Reality, and Mixed Reality's history and social impact. The review reveals not only the traditional historical review but also contains a data research study. The research focuses on the case study paper, which proposed a bright, interactive future with technology in educational field. We compared the proposed future view and the current development. This paper collected 269 citations from 2005 to 2020 and analyzed them, assessing whether they belonged to technical or theoretical paper. The paper uses the collected data to discuss industrial developing trends and indicates the possible future view based on the data study result
    corecore