8 research outputs found

    Clustering Time Series from Mixture Polynomial Models with Discretised Data

    Get PDF
    Clustering time series is an active research area with applications in many fields. One common feature of time series is the likely presence of outliers. These uncharacteristic data can significantly effect the quality of clusters formed. This paper evaluates a method of over-coming the detrimental effects of outliers. We describe some of the alternative approaches to clustering time series, then specify a particular class of model for experimentation with k-means clustering and a correlation based distance metric. For data derived from this class of model we demonstrate that discretising the data into a binary series of above and below the median improves the clustering when the data has outliers. More specifically, we show that firstly discretisation does not significantly effect the accuracy of the clusters when there are no outliers and secondly it significantly increases the accuracy in the presence of outliers, even when the probability of outlier is very low

    Clustering for Data Reduction: A Divide and Conquer Approach

    Get PDF
    We consider the problem of reducing a potentially very large dataset to a subset of representative prototypes. Rather than searching over the entire space of prototypes, we first roughly divide the data into balanced clusters using bisecting k-means and spectral cuts, and then find the prototypes for each cluster by affinity propagation. We apply our algorithm to text data, where we perform an order of magnitude faster than simply looking for prototypes on the entire dataset. Furthermore, our "divide and conquer" approach actually performs more accurately on datasets which are well bisected, as the greedy decisions of affinity propagation are confined to classes of already similar items

    Parametric model-based clustering

    Get PDF

    Recent Developments in Document Clustering

    Get PDF
    This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed

    Aplicación de la regresión polinómica local al análisis discriminante y análisis cluster de series de tiempo

    Get PDF
    [Resumen]La tesis se centra en el análisis discriminante y análisis cluster de series de tiempo. La preponderancia de este tipo de datos en múltiples áreas de trabajo, como la sismología, economía, física o medicina, entre otras, hacen del análisis discriminante y del análisis clúster de series temporales problemas de gran interés teórico y práctico. SI bien ambas situaciones han sido exhaustivamente estudiadas desde el punto de vista de la teoría multivariante clásica, las características propias de las series temporales hacen que las soluciones desarrolladas para la clasificación de datos estáticos no siempre resulten adecuadas para abordar el proceso de clasificación de procesos estocásticos. En esta tesis se presentan nuevos procedimientos, de corte no paramétrico, para abordar el análisis discriminante y cluster de series temporales en el ámbito espectral. La novedad de estos métodos radica en la utilización de estimadores tipo núcleo, basados en técnicas de regresión polinómica local, para la estimación de la densidad espectral de los procesos sujetos a clasificación. En el contexto del análisis discriminante, se propone un nuevo criterio para la clasificación de series de tiempo, basado en una medida de disparidad definida entre un estimador no paramétrico de la densidad espectral del proceso que se intenta clasificar y la densidad espectral de cada una de las clases de procesos entre las que se discrimina. Para la estimación del espectro se propone utilizar tres estimadores tipo núcleo basados en técnicas de regresión polinómica local. Se demuestra la normalidad asintótica del estadístico discriminante propuesto, y la convergencia a cero de las probabilidades de mala clasificación, tanto en el caso de conocer la densidad teórica de las clases entre las que se discrimina, como en el caso de tener que estimarlas a partir de muestras de entrenamiento

    Scalable, Balanced Model-based Clustering

    No full text
    This paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. Partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process---iterative model re-estimation and sample re-assignment. Instead of a maximum-likelihood (ML) assignment, a balanceconstrained approach is used for the sample assignment step. An e#cient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. We demonstrate the superiority of this approach to regular ML clustering on complex data such as arbitraryshape 2-D spatial data, high-dimensional text documents, and EEG time series
    corecore