381 research outputs found

    Knowledge discovery in data streams

    Full text link
    Knowing what to do with the massive amount of data collected has always been an ongoing issue for many organizations. While data mining has been touted to be the solution, it has failed to deliver the impact despite its successes in many areas. One reason is that data mining algorithms were not designed for the real world, i.e., they usually assume a static view of the data and a stable execution environment where resources are abundant. The reality however is that data are constantly changing and the execution environment is dynamic. Hence, it becomes difficult for data mining to truly deliver timely and relevant results. Recently, the processing of stream data has received many attention. What is interesting is that the methodology to design stream-based algorithms may well be the solution to the above problem. In this entry, we discuss this issue and present an overview of recent works

    Analyzing frequent patterns in data streams using a dynamic compact stream pattern algorithm

    Get PDF
    As a result of modern technology and the advancement in communication, a large amount of data streams are continually generated from various online applications, devices and sources. Mining frequent patterns from these streams of data is now an important research topic in the field of data mining and knowledge discovery. The traditional approach of mining data may not be appropriate for a large volume of data stream environment where the data volume is quite large and unbounded. They have the limitation of extracting recent change of knowledge in an adaptive mode from the data stream. Many algorithms and models have been developed to address the challenging task of mining data from an infinite influx of data generated from various points over the internet. The objective of this thesis is to introduce the concept of Dynamic Compact Pattern Stream tree (DCPS-tree) algorithm for mining recent data from the continuous data stream. Our DCPS-tree will dynamically achieves frequency descending prefix tree structure with only a single-pass over the data by applying tree restructuring techniques such as Branch sort method (BSM). This will cause any low frequency pattern to be maintained at the leaf nodes level and any high frequency components at a higher level. As a result of this, there will be a considerable mining time reduction on the datase

    Big continuous data: dealing with velocity by composing event streams

    No full text
    International audienceThe rate at which we produce data is growing steadily, thus creating even larger streams of continuously evolving data. Online news, micro-blogs, search queries are just a few examples of these continuous streams of user activities. The value of these streams relies in their freshness and relatedness to on-going events. Modern applications consuming these streams need to extract behaviour patterns that can be obtained by aggregating and mining statically and dynamically huge event histories. An event is the notification that a happening of interest has occurred. Event streams must be combined or aggregated to produce more meaningful information. By combining and aggregating them either from multiple producers, or from a single one during a given period of time, a limited set of events describing meaningful situations may be notified to consumers. Event streams with their volume and continuous production cope mainly with two of the characteristics given to Big Data by the 5V’s model: volume & velocity. Techniques such as complex pattern detection, event correlation, event aggregation, event mining and stream processing, have been used for composing events. Nevertheless, to the best of our knowledge, few approaches integrate different composition techniques (online and post-mortem) for dealing with Big Data velocity. This chapter gives an analytical overview of event stream processing and composition approaches: complex event languages, services and event querying systems on distributed logs. Our analysis underlines the challenges introduced by Big Data velocity and volume and use them as reference for identifying the scope and limitations of results stemming from different disciplines: networks, distributed systems, stream databases, event composition services, and data mining on traces

    Methods for frequent pattern mining in data streams within the MOA system

    Get PDF
    IncMine is a robust, efficient, practical, usable and extendable solution to perform Frequent Itemset mining over data streams. It is implementend under the Massive Online Analysis framework. It includes an analysis over its performances and its reaction to synthetic and real concept drift

    A Survey on the Integration of NAND Flash Storage in the Design of File Systems and the Host Storage Software Stack

    Full text link
    With the ever-increasing amount of data generate in the world, estimated to reach over 200 Zettabytes by 2025, pressure on efficient data storage systems is intensifying. The shift from HDD to flash-based SSD provides one of the most fundamental shifts in storage technology, increasing performance capabilities significantly. However, flash storage comes with different characteristics than prior HDD storage technology. Therefore, storage software was unsuitable for leveraging the capabilities of flash storage. As a result, a plethora of storage applications have been design to better integrate with flash storage and align with flash characteristics. In this literature study we evaluate the effect the introduction of flash storage has had on the design of file systems, which providing one of the most essential mechanisms for managing persistent storage. We analyze the mechanisms for effectively managing flash storage, managing overheads of introduced design requirements, and leverage the capabilities of flash storage. Numerous methods have been adopted in file systems, however prominently revolve around similar design decisions, adhering to the flash hardware constrains, and limiting software intervention. Future design of storage software remains prominent with the constant growth in flash-based storage devices and interfaces, providing an increasing possibility to enhance flash integration in the host storage software stack

    A Survey on the Integration of NAND Flash Storage in the Design of File Systems and the Host Storage Software Stack

    Get PDF
    With the ever-increasing amount of data generate in the world, estimated to reach over 200 Zettabytes by 2025, pressure on efficient data storage systems is intensifying. The shift from HDD to flash-based SSD provides one of the most fundamental shifts in storage technology, increasing performance capabilities significantly. However, flash storage comes with different characteristics than prior HDD storage technology. Therefore, storage software was unsuitable for leveraging the capabilities of flash storage. As a result, a plethora of storage applications have been design to better integrate with flash storage and align with flash characteristics. In this literature study we evaluate the effect the introduction of flash storage has had on the design of file systems, which providing one of the most essential mechanisms for managing persistent storage. We analyze the mechanisms for effectively managing flash storage, managing overheads of introduced design requirements, and leverage the capabilities of flash storage. Numerous methods have been adopted in file systems, however prominently revolve around similar design decisions, adhering to the flash hardware constrains, and limiting software intervention. Future design of storage software remains prominent with the constant growth in flash-based storage devices and interfaces, providing an increasing possibility to enhance flash integration in the host storage software stack

    An association rule dynamics and classification approach to event detection and tracking in Twitter.

    Get PDF
    Twitter is a microblogging application used for sending and retrieving instant on-line messages of not more than 140 characters. There has been a surge in Twitter activities since its launch in 2006 as well as steady increase in event detection research on Twitter data (tweets) in recent years. With 284 million monthly active users Twitter has continued to grow both in size and activity. The network is rapidly changing the way global audience source for information and influence the process of journalism [Newman, 2009]. Twitter is now perceived as an information network in addition to being a social network. This explains why traditional news media follow activities on Twitter to enhance their news reports and news updates. Knowing the significance of the network as an information dissemination platform, news media subscribe to Twitter accounts where they post their news headlines and include the link to their on-line news where the full story may be found. Twitter users in some cases, post breaking news on the network before such news are published by traditional news media. This can be ascribed to Twitter subscribers' nearness to location of events. The use of Twitter as a network for information dissemination as well as for opinion expression by different entities is now common. This has also brought with it the issue of computational challenges of extracting newsworthy contents from Twitter noisy data. Considering the enormous volume of data Twitter generates, users append the hashtag (#) symbol as prefix to keywords in tweets. Hashtag labels describe the content of tweets. The use of hashtags also makes it easy to search for and read tweets of interest. The volume of Twitter streaming data makes it imperative to derive Topic Detection and Tracking methods to extract newsworthy topics from tweets. Since hashtags describe and enhance the readability of tweets, this research is developed to show how the appropriate use of hashtags keywords in tweets can demonstrate temporal evolvements of related topic in real-life and consequently enhance Topic Detection and Tracking on Twitter network. We chose to apply our method on Twitter network because of the restricted number of characters per message and for being a network that allows sharing data publicly. More importantly, our choice was based on the fact that hashtags are an inherent component of Twitter. To this end, the aim of this research is to develop, implement and validate a new approach that extracts newsworthy topics from tweets' hashtags of real-life topics over a specified period using Association Rule Mining. We termed our novel methodology Transaction-based Rule Change Mining (TRCM). TRCM is a system built on top of the Apriori method of Association Rule Mining to extract patterns of Association Rules changes in tweets hashtag keywords at different periods of time and to map the extracted keywords to related real-life topic or scenario. To the best of our knowledge, the adoption of dynamics of Association Rules of hashtag co-occurrences has not been explored as a Topic Detection and Tracking method on Twitter. The application of Apriori to hashtags present in tweets at two consecutive period t and t + 1 produces two association rulesets, which represents rules evolvement in the context of this research. A change in rules is discovered by matching every rule in ruleset at time t with those in ruleset at time t + 1. The changes are grouped under four identified rules namely 'New' rules, 'Unexpected Consequent' and 'Unexpected Conditional' rules, 'Emerging' rules and 'Dead' rules. The four rules represent different levels of topic real-life evolvements. For example, the emerging rule represents very important occurrence such as breaking news, while unexpected rules represents unexpected twist of event in an on-going topic. The new rule represents dissimilarity in rules in rulesets at time t and t+1. Finally, the dead rule represents topic that is no longer present on the Twitter network. TRCM revealed the dynamics of Association Rules present in tweets and demonstrates the linkage between the different types of rule dynamics to targeted real-life topics/events. In this research, we conducted experimental studies on tweets from different domains such as sports and politics to test the performance effectiveness of our method. We validated our method, TRCM with carefully chosen ground truth. The outcome of our research experiments include: Identification of 4 rule dynamics in tweets' hashtags namely: New rules, Emerging rules, Unexpected rules and 'Dead' rules using Association Rule Mining. These rules signify how news and events evolved in real-life scenario. Identification of rule evolvements on Twitter network using Rule Trend Analysis and Rule Trace. Detection and tracking of topic evolvements on Twitter using Transaction-based Rule Change Mining TRCM. Identification of how the peculiar features of each TRCM rules affect their performance effectiveness on real datasets

    Mining Time-Changing Data Streams

    Get PDF
    Streaming data have gained considerable attention in database and data mining communities because of the emergence of a class of applications, such as financial marketing, sensor networks, internet IP monitoring, and telecommunications that produce these data. Data streams have some unique characteristics that are not exhibited by traditional data: unbounded, fast-arriving, and time-changing. Traditional data mining techniques that make multiple passes over data or that ignore distribution changes are not applicable to dynamic data streams. Mining data streams has been an active research area to address requirements of the streaming applications. This thesis focuses on developing techniques for distribution change detection and mining time-changing data streams. Two techniques are proposed that can detect distribution changes in generic data streams. One approach for tackling one of the most popular stream mining tasks, frequent itemsets mining, is also presented in this thesis. All the proposed techniques are implemented and empirically studied. Experimental results show that the proposed techniques can achieve promising performance for detecting changes and mining dynamic data streams

    Sampling Algorithms for Evolving Datasets

    Get PDF
    Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system's point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid under- or oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing

    Multiple-Aspect Analysis of Semantic Trajectories

    Get PDF
    This open access book constitutes the refereed post-conference proceedings of the First International Workshop on Multiple-Aspect Analysis of Semantic Trajectories, MASTER 2019, held in conjunction with the 19th European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2019, in WĂĽrzburg, Germany, in September 2019. The 8 full papers presented were carefully reviewed and selected from 12 submissions. They represent an interesting mix of techniques to solve recurrent as well as new problems in the semantic trajectory domain, such as data representation models, data management systems, machine learning approaches for anomaly detection, and common pathways identification
    • …
    corecore