4,658 research outputs found

    Applying data mining techniques over big data

    Full text link
    Thesis (M.S.C.S.) PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at [email protected]. Thank you.The rapid development of information technology in recent decades means that data appear in a wide variety of formats — sensor data, tweets, photographs, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today’s internet has about 0.1 Zettabytes of data (ZB is about 1021 bytes), and this number will reach 35 ZB by 2020. With such an overwhelming flood of information, present data management systems are not able to scale to this huge amount of raw, unstructured data—in today’s parlance, Big Data. In the present study, we show the basic concepts and design of Big Data tools, algorithms, and techniques. We compare the classical data mining algorithms to the Big Data algorithms by using Hadoop/MapReduce as a core implementation of Big Data for scalable algorithms. We implemented the K-means algorithm and A-priori algorithm with Hadoop/MapReduce on a 5 nodes Hadoop cluster. We explore NoSQL databases for semi-structured, massively large-scaling of data by using MongoDB as an example. Finally, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data storage for these two algorithms

    A Parallel Space Saving Algorithm For Frequent Items and the Hurwitz zeta distribution

    Get PDF
    We present a message-passing based parallel version of the Space Saving algorithm designed to solve the kk--majority problem. The algorithm determines in parallel frequent items, i.e., those whose frequency is greater than a given threshold, and is therefore useful for iceberg queries and many other different contexts. We apply our algorithm to the detection of frequent items in both real and synthetic datasets whose probability distribution functions are a Hurwitz and a Zipf distribution respectively. Also, we compare its parallel performances and accuracy against a parallel algorithm recently proposed for merging summaries derived by the Space Saving or Frequent algorithms.Comment: Accepted for publication. To appear in Information Sciences, Elsevier. http://www.sciencedirect.com/science/article/pii/S002002551500657

    Sampling Algorithms for Evolving Datasets

    Get PDF
    Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system's point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid under- or oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing

    An association rule dynamics and classification approach to event detection and tracking in Twitter.

    Get PDF
    Twitter is a microblogging application used for sending and retrieving instant on-line messages of not more than 140 characters. There has been a surge in Twitter activities since its launch in 2006 as well as steady increase in event detection research on Twitter data (tweets) in recent years. With 284 million monthly active users Twitter has continued to grow both in size and activity. The network is rapidly changing the way global audience source for information and influence the process of journalism [Newman, 2009]. Twitter is now perceived as an information network in addition to being a social network. This explains why traditional news media follow activities on Twitter to enhance their news reports and news updates. Knowing the significance of the network as an information dissemination platform, news media subscribe to Twitter accounts where they post their news headlines and include the link to their on-line news where the full story may be found. Twitter users in some cases, post breaking news on the network before such news are published by traditional news media. This can be ascribed to Twitter subscribers' nearness to location of events. The use of Twitter as a network for information dissemination as well as for opinion expression by different entities is now common. This has also brought with it the issue of computational challenges of extracting newsworthy contents from Twitter noisy data. Considering the enormous volume of data Twitter generates, users append the hashtag (#) symbol as prefix to keywords in tweets. Hashtag labels describe the content of tweets. The use of hashtags also makes it easy to search for and read tweets of interest. The volume of Twitter streaming data makes it imperative to derive Topic Detection and Tracking methods to extract newsworthy topics from tweets. Since hashtags describe and enhance the readability of tweets, this research is developed to show how the appropriate use of hashtags keywords in tweets can demonstrate temporal evolvements of related topic in real-life and consequently enhance Topic Detection and Tracking on Twitter network. We chose to apply our method on Twitter network because of the restricted number of characters per message and for being a network that allows sharing data publicly. More importantly, our choice was based on the fact that hashtags are an inherent component of Twitter. To this end, the aim of this research is to develop, implement and validate a new approach that extracts newsworthy topics from tweets' hashtags of real-life topics over a specified period using Association Rule Mining. We termed our novel methodology Transaction-based Rule Change Mining (TRCM). TRCM is a system built on top of the Apriori method of Association Rule Mining to extract patterns of Association Rules changes in tweets hashtag keywords at different periods of time and to map the extracted keywords to related real-life topic or scenario. To the best of our knowledge, the adoption of dynamics of Association Rules of hashtag co-occurrences has not been explored as a Topic Detection and Tracking method on Twitter. The application of Apriori to hashtags present in tweets at two consecutive period t and t + 1 produces two association rulesets, which represents rules evolvement in the context of this research. A change in rules is discovered by matching every rule in ruleset at time t with those in ruleset at time t + 1. The changes are grouped under four identified rules namely 'New' rules, 'Unexpected Consequent' and 'Unexpected Conditional' rules, 'Emerging' rules and 'Dead' rules. The four rules represent different levels of topic real-life evolvements. For example, the emerging rule represents very important occurrence such as breaking news, while unexpected rules represents unexpected twist of event in an on-going topic. The new rule represents dissimilarity in rules in rulesets at time t and t+1. Finally, the dead rule represents topic that is no longer present on the Twitter network. TRCM revealed the dynamics of Association Rules present in tweets and demonstrates the linkage between the different types of rule dynamics to targeted real-life topics/events. In this research, we conducted experimental studies on tweets from different domains such as sports and politics to test the performance effectiveness of our method. We validated our method, TRCM with carefully chosen ground truth. The outcome of our research experiments include: Identification of 4 rule dynamics in tweets' hashtags namely: New rules, Emerging rules, Unexpected rules and 'Dead' rules using Association Rule Mining. These rules signify how news and events evolved in real-life scenario. Identification of rule evolvements on Twitter network using Rule Trend Analysis and Rule Trace. Detection and tracking of topic evolvements on Twitter using Transaction-based Rule Change Mining TRCM. Identification of how the peculiar features of each TRCM rules affect their performance effectiveness on real datasets
    • …
    corecore