5 research outputs found

    Divisive clustering of high dimensional data streams

    Get PDF
    Clustering streaming data is gaining importance as automatic data acquisition technologies are deployed in diverse applications. We propose a fully incremental projected divisive clustering method for high-dimensional data streams that is motivated by high density clustering. The method is capable of identifying clusters in arbitrary subspaces, estimating the number of clusters, and detecting changes in the data distribution which necessitate a revision of the model. The empirical evaluation of the proposed method on numerous real and simulated datasets shows that it is scalable in dimension and number of clusters, is robust to noisy and irrelevant features, and is capable of handling a variety of types of non-stationarity

    Processamento em streaming: avaliação de frameworks em contexto Big Data

    Get PDF
    Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de InformaçãoNos dias de hoje, o vasto volume de dados produzido é um dos focos de atenção da comunidade científica de Sistemas de Informação. As ferramentas de gestão de dados tradicionais existentes não conseguem processar estes dados em tempo útil, sendo por isso necessário utilizar tecnologias mais adequadas de forma a possibilitar o processamento de um volume de dados mais elevado. Neste contexto, surge o termo Big Data, que descreve conjuntos de dados de grandes dimensões, de diferentes tipos e com diferentes graus de complexidade. Big Data tem um papel de extrema importância seja qual for a área de negócio, auxiliando a tomada de decisão e perceção das tendências futuras, alavancando a vantagem competitiva das organizações. Apesar das reconhecidas vantagens de Big Data e das tecnologias associadas, as aplicações que requerem processamento em tempo real de grandes fluxos de dados têm levado ao limite estas tecnologias. Para colmatar estas limitações surgiram novas ferramentas de processamento de dados em streaming. Estas ferramentas permitem a obtenção de resultados com tempos de espera reduzidos e resolvem o problema da elevada latência que os sistemas de processamento anteriores apresentavam. O objetivo desta dissertação é realizar um benchmark das principais frameworks de processamento em streaming no contexto Big Data. Para o sucesso da mesma realizou-se um enquadramento conceptual e tecnológico, onde foram levantados os principais conceitos associados ao termo Big Data, assim como das principais técnicas e ferramentas com especial destaque no streaming. Para a elaboração do benchmark, foi definida uma infraestrutura tecnológica no Google Cloud Platform e ainda os indicadores e métricas para posterior análise. Concluídos todos os testes definidos, foi possível perceber o comportamento de cada framework, as suas vantagens e desvantagens face ás diferentes necessidades no contexto de streaming.Nowadays, the vast volume of data produced is one of the focus of attention of the scientific community of Information Systems. Existing traditional data management tools are unable to process these data in a timely manner, so it is necessary to use more appropriate technologies in order to allow the processing a higher volume of data. In this context, the term Big Data appears, which describes large dimensions datasets, of different types and with different degrees of complexity. Big Data plays an extremely important role in all business areas, helping to make decisions and perceive future trends, leveraging the competitive advantage of organizations. Despite the recognized advantages of Big Data and associated technologies, applications that require real-time processing of large data streams have pushed these technologies to the limit. To address these limitations, new tools for streaming data processing have emerged. These tools allow the obtaining of results with reduced waiting times and solve the problem of high latency that previous processing systems had. The objective of this dissertation is to perform a benchmark of the main processing streaming frameworks in the Big Data context. For his success was realized a conceptual and technological framework, where were raised the main concepts associated with the term Big Data, as well as of the main techniques and tools with special emphasis in streaming tools. For the elaboration of the benchmark, a technological infrastructure was defined in the Google Cloud Platform and all the indicators and metrics needed for later analysis. After all the tests were executed, it was possible to perceive the behavior of each framework, its advantages and disadvantages in relation to different needs in the context of streaming

    Projection methods for clustering and semi-supervised classification

    Get PDF
    This thesis focuses on data projection methods for the purposes of clustering and semi-supervised classification, with a primary focus on clustering. A number of contributions are presented which address this problem in a principled manner; using projection pursuit formulations to identify subspaces which contain useful information for the clustering task. Projection methods are extremely useful in high dimensional applications, and situations in which the data contain irrelevant dimensions which can be counterinformative for the clustering task. The final contribution addresses high dimensionality in the context of a data stream. Data streams and high dimensionality have been identified as two of the key challenges in data clustering. The first piece of work is motivated by identifying the minimum density hyperplane separator in the finite sample setting. This objective is directly related to the problem of discovering clusters defined as connected regions of high data density, which is a widely adopted definition in non-parametric statistics and machine learning. A thorough investigation into the theoretical aspects of this method, as well as the practical task of solving the associated optimisation problem efficiently is presented. The proposed methodology is applied to both clustering and semi-supervised classification problems, and is shown to reliably find low density hyperplane separators in both contexts. The second and third contributions focus on a different approach to clustering based on graph cuts. The minimum normalised graph cut objective has gained considerable attention as relaxations of the objective have been developed, which make them solvable for reasonably well sized problems. This has been adopted by the highly popular spectral clustering methods. The second piece of work focuses on identifying the optimal subspace in which to perform spectral clustering, by minimising the second eigenvalue of the graph Laplacian for a graph defined over the data within that subspace. A rigorous treatment of this objective is presented, and an algorithm is proposed for its optimisation. An approximation method is proposed which allows this method to be applied to much larger problems than would otherwise be possible. An extension of this work deals with the spectral projection pursuit method for semi-supervised classification. iii The third body of work looks at minimising the normalised graph cut using hyperplane separators. This formulation allows for the exact normalised cut to be computed, rather than the spectral relaxation. It also allows for a computationally efficient method for optimisation. The asymptotic properties of the normalised cut based on a hyperplane separator are investigated, and shown to have similarities with the clustering objective based on low density separation. In fact, both the methods in the second and third works are shown to be connected with the first, in that all three have the same solution asymptotically, as their relative scaling parameters are reduced to zero. The final body of work addresses both problems of high dimensionality and incremental clustering in a data stream context. A principled statistical framework is adopted, in which clustering by low density separation again becomes the focal objective. A divisive hierarchical clustering model is proposed, using a collection of low density hyperplanes. The adopted framework provides well founded methodology for determining the number of clusters automatically, and also identifying changes in the data stream which are relevant to the clustering objective. It is apparent that no existing methods can make both of these claims

    Anytime algorithms for stream data mining

    Get PDF
    Data is collected and stored everywhere, be it images or audio files on private computers, customer data in traditional or electronic businesses, performance or control data in production sites, web traffic and click streams at internet providers, statistical data at government agencies, sensor measurements in scientific experimentation, surveillance data, etc. There are countless examples, and the amount of data is tremendous. Data mining is the process of finding useful and previously unknown patterns in data. In the examples listed above, data mining can be used for automated recommendation of audio files, business analysis and target marketing, or performance optimization and hazard warnings. While early mining algorithms only considered static data sets, research and practice in data mining must nowadays deal with continuous, possible infinite streams of data, which are prevalent in most real world applications and scenarios. Anytime algorithms constitute a special type of algorithm that is well suited to work on data streams. They inherit their name from their ability to provide a result after any amount of processing time. The amount of time available is not known to the algorithm in advance: anytime algorithms quickly compute an initial result and strive to improve it as long as time remains. When interrupted they deliver the best result obtained until that point in time. In this thesis anytime classification is studied in depth for the Bayesian approach. New algorithmic solutions for anytime classification are developed and evaluated in extensive experimentation. The first anytime stream clustering algorithm is proposed, and an application to anytime outlier detection is presented. In addition to the algorithmic contributions, new meta-approaches are described that significantly widen the area of applications for anytime algorithms. The solutions and results of this thesis contribute to the state of the art in anytime algorithms and stream data mining research

    Anytime algorithms for stream data mining

    No full text
    Data is collected and stored everywhere, be it images or audio files on private computers, customer data in traditional or electronic businesses, performance or control data in production sites, web traffic and click streams at internet providers, statistical data at government agencies, sensor measurements in scientific experimentation, surveillance data, etc. There are countless examples, and the amount of data is tremendous. Data mining is the process of finding useful and previously unknown patterns in data. In the examples listed above, data mining can be used for automated recommendation of audio files, business analysis and target marketing, or performance optimization and hazard warnings. While early mining algorithms only considered static data sets, research and practice in data mining must nowadays deal with continuous, possible infinite streams of data, which are prevalent in most real world applications and scenarios. Anytime algorithms constitute a special type of algorithm that is well suited to work on data streams. They inherit their name from their ability to provide a result after any amount of processing time. The amount of time available is not known to the algorithm in advance: anytime algorithms quickly compute an initial result and strive to improve it as long as time remains. When interrupted they deliver the best result obtained until that point in time. In this thesis anytime classification is studied in depth for the Bayesian approach. New algorithmic solutions for anytime classification are developed and evaluated in extensive experimentation. The first anytime stream clustering algorithm is proposed, and an application to anytime outlier detection is presented. In addition to the algorithmic contributions, new meta-approaches are described that significantly widen the area of applications for anytime algorithms. The solutions and results of this thesis contribute to the state of the art in anytime algorithms and stream data mining research
    corecore