3,289 research outputs found

    Pattern mining under different conditions

    Get PDF
    New requirements and demands on pattern mining arise in modern applications, which cannot be fulfilled using conventional methods. For example, in scientific research, scientists are more interested in unknown knowledge, which usually hides in significant but not frequent patterns. However, existing itemset mining algorithms are designed for very frequent patterns. Furthermore, scientists need to repeat an experiment many times to ensure reproducibility. A series of datasets are generated at once, waiting for clustering, which can contain an unknown number of clusters with various densities and shapes. Using existing clustering algorithms is time-consuming because parameter tuning is necessary for each dataset. Many scientific datasets are extremely noisy. They contain considerably more noises than in-cluster data points. Most existing clustering algorithms can only handle noises up to a moderate level. Temporal pattern mining is also important in scientific research. Existing temporal pattern mining algorithms only consider pointbased events. However, most activities in the real-world are interval-based with a starting and an ending timestamp. This thesis developed novel pattern mining algorithms for various data mining tasks under different conditions. The first part of this thesis investigates the problem of mining less frequent itemsets in transactional datasets. In contrast to existing frequent itemset mining algorithms, this part focus on itemsets that occurred not that frequent. Algorithms NIIMiner, RaCloMiner, and LSCMiner are proposed to identify such kind of itemsets efficiently. NIIMiner utilizes the negative itemset tree to extract all patterns that occurred less than a given support threshold in a top-down depth-first manner. RaCloMiner combines existing bottom-up frequent itemset mining algorithms with a top-down itemset mining algorithm to achieve a better performance in mining less frequent patterns. LSCMiner investigates the problem of mining less frequent closed patterns. The second part of this thesis studied the problem of interval-based temporal pattern mining in the stream environment. Interval-based temporal patterns are sequential patterns in which each event is aligned with a starting and ending temporal information. The ability to handle interval-based events and stream data is lacking in existing approaches. A novel intervalbased temporal pattern mining algorithm for stream data is described in this part. The last part of this thesis studies new problems in clustering on numeric datasets. The first problem tackled in this part is shape alternation adaptivity in clustering. In applications such as scientific data analysis, scientists need to deal with a series of datasets generated from one experiment. Cluster sizes and shapes are different in those datasets. A kNN density-based clustering algorithm, kadaClus, is proposed to provide the shape alternation adaptability so that users do not need to tune parameters for each dataset. The second problem studied in this part is clustering in an extremely noisy dataset. Many real-world datasets contain considerably more noises than in-cluster data points. A novel clustering algorithm, kenClus, is proposed to identify clusters in arbitrary shapes from extremely noisy datasets. Both clustering algorithms are kNN-based, which only require one parameter k. In each part, the efficiency and effectiveness of the presented techniques are thoroughly analyzed. Intensive experiments on synthetic and real-world datasets are conducted to show the benefits of the proposed algorithms over conventional approaches

    SOTXTSTREAM: Density-based self-organizing clustering of text streams

    Get PDF
    A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets

    AMANDA : density-based adaptive model for nonstationary data under extreme verification latency scenarios

    Get PDF
    Gradual concept-drift refers to a smooth and gradual change in the relations between input and output data in the underlying distribution over time. The problem generates a model obsolescence and consequently a quality decrease in predictions. Besides, there is a challenging task during the stream: The extreme verification latency (EVL) to verify the labels. For batch scenarios, state-of-the-art methods propose an adaptation of a supervised model by using an unconstrained least squares importance fitting (uLSIF) algorithm or a semi-supervised approach along with a core support extraction (CSE) method. However, these methods do not properly tackle the mentioned problems due to their high computational time for large data volumes, lack in representing the right samples of the drift or even for having several parameters for tuning. Therefore, we propose a density-based adaptive model for nonstationary data (AMANDA), which uses a semi-supervised classifier along with a CSE method. AMANDA has two variations: AMANDA with a fixed cutting percentage (AMANDA-FCP); and AMANDA with a dynamic cutting percentage (AMANDADCP). Our results indicate that the two variations of AMANDA outperform the state-of-the-art methods for almost all synthetic datasets and real ones with an improvement up to 27.98% regarding the average error. We have found that the use of AMANDA-FCP improved the results for a gradual concept-drift even with a small size of initial labeled data. Moreover, our results indicate that SSL classifiers are improved when they work along with our static or dynamic CSE methods. Therefore, we emphasize the importance of research directions based on this approach.Concept-drift gradual refere-se à mudança suave e gradual na distribuição dos dados conforme o tempo passa. Este problema causa obsolescência no modelo de aprendizado e queda na qualidade das previsões. Além disso, existe um complicador durante o processamento dos dados: a latência de verificação extrema (LVE) para se verificar os rótulos. Métodos do estado da arte propõem uma adaptação do modelo supervisionado usando uma abordagem de estimação de importância baseado em mínimos quadrados ou usando uma abordagem semi-supervisionada em conjunto com a extração de instâncias centrais, na sigla em inglês (CSE). Entretanto, estes métodos não tratam adequadamente os problemas mencionados devido ao fato de requererem alto tempo computacional para processar grandes volumes de dados, falta de correta seleção das instâncias que representam a mudança da distribuição, ou ainda por demandarem o ajuste de grande quantidade de parâmetros. Portanto, propomos um modelo adaptativo baseado em densidades para dados não-estacionários (AMANDA), que tem como base um classificador semi-supervisionado e um método CSE baseado em densidade. AMANDA tem duas variações: percentual de corte fixo (AMANDAFCP); e percentual de corte dinâmico (AMANDA-DCP). Nossos resultados indicam que as duas variações da proposta superam o estado da arte em quase todas as bases de dados sintéticas e reais em até 27,98% em relação ao erro médio. Concluímos que a aplicação do método AMANDA-FCP faz com que a classificação melhore mesmo quando há uma pequena porção inicial de dados rotulados. Mais ainda, os classificadores semi-supervisionados são melhorados quando trabalham em conjunto com nossos métodos de CSE, estático ou dinâmico

    Efficient Data Streaming Analytic Designs for Parallel and Distributed Processing

    Get PDF
    Today, ubiquitously sensing technologies enable inter-connection of physical\ua0objects, as part of Internet of Things (IoT), and provide massive amounts of\ua0data streams. In such scenarios, the demand for timely analysis has resulted in\ua0a shift of data processing paradigms towards continuous, parallel, and multitier\ua0computing. However, these paradigms are followed by several challenges\ua0especially regarding analysis speed, precision, costs, and deterministic execution.\ua0This thesis studies a number of such challenges to enable efficient continuous\ua0processing of streams of data in a decentralized and timely manner.In the first part of the thesis, we investigate techniques aiming at speeding\ua0up the processing without a loss in precision. The focus is on continuous\ua0machine learning/data mining types of problems, appearing commonly in IoT\ua0applications, and in particular continuous clustering and monitoring, for which\ua0we present novel algorithms; (i) Lisco, a sequential algorithm to cluster data\ua0points collected by LiDAR (a distance sensor that creates a 3D mapping of the\ua0environment), (ii) p-Lisco, the parallel version of Lisco to enhance pipeline- and\ua0data-parallelism of the latter, (iii) pi-Lisco, the parallel and incremental version\ua0to reuse the information and prevent redundant computations, (iv) g-Lisco, a\ua0generalized version of Lisco to cluster any data with spatio-temporal locality\ua0by leveraging the implicit ordering of the data, and (v) Amble, a continuous\ua0monitoring solution in an industrial process.In the second part, we investigate techniques to reduce the analysis costs\ua0in addition to speeding up the processing while also supporting deterministic\ua0execution. The focus is on problems associated with availability and utilization\ua0of computing resources, namely reducing the volumes of data, involving\ua0concurrent computing elements, and adjusting the level of concurrency. For\ua0that, we propose three frameworks; (i) DRIVEN, a framework to continuously\ua0compress the data and enable efficient transmission of the compact data in the\ua0processing pipeline, (ii) STRATUM, a framework to continuously pre-process\ua0the data before transferring the later to upper tiers for further processing, and\ua0(iii) STRETCH, a framework to enable instantaneous elastic reconfigurations\ua0to adjust intra-node resources at runtime while ensuring determinism.The algorithms and frameworks presented in this thesis contribute to an\ua0efficient processing of data streams in an online manner while utilizing available\ua0resources. Using extensive evaluations, we show the efficiency and achievements\ua0of the proposed techniques for IoT representative applications that involve a\ua0wide spectrum of platforms, and illustrate that the performance of our work\ua0exceeds that of state-of-the-art techniques

    Learning from Data Streams: An Overview and Update

    Full text link
    The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings
    • …
    corecore