72 research outputs found

    The Minimum Description Length Principle for Pattern Mining: A Survey

    Full text link
    This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum. Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems

    A Survey on Discovering High Utility Itemset Mining from Transactional Database

    Get PDF
    Data Mining is the process of evaluating data from different outlooks and summarizing it into useful information. It can be defined as the process that extracts information contained in very large database. Traditional Data mining methods have been focused on to finding a correlation between items which are frequently appearing in the database. And relative importance of each item is not consider in frequent pattern mining. High utility mining is an area research where utility based mining can be done. Mining high utility itemset from a transactional database refers to the discovery of itemset with high utility in a terms like weight, unit profit or value. In this paper we present literature survey of currently used algorithms for high utility itemset mining. Keywords: High utility,  Transactional Database, HUI_Miner, FH

    Efficient Temporal Synopsis of Social Media Streams

    Get PDF
    Search and summarization of streaming social media, such as Twitter, requires the ongoing analysis of large volumes of data with dynamically changing characteristics. Tweets are short and repetitious -- lacking context and structure -- making it difficult to generate a coherent synopsis of events within a given time period. Although some established algorithms for frequent itemset analysis might provide an efficient foundation for synopsis generation, the unmodified application of standard methods produces a complex mass of rules, dominated by common language constructs and many trivial variations on topically related results. Moreover, these results are not necessarily specific to events within the time period of interest. To address these problems, we build upon the Linear time Closed itemset Mining (LCM) algorithm, which is particularly suited to the large and sparse vocabulary of tweets. LCM generates only closed itemsets, providing an immediate reduction in the number of trivial results. To reduce the impact of function words and common language constructs, we apply a filltering step that preserves these terms only when they may form part of a relevant collocation. To further reduce trivial results, we propose a novel strengthening of the closure condition of LCM to retain only those results that exceed a threshold of distinctiveness. Finally, we perform temporal ranking, based on information gain, to identify results that are particularly relevant to the time period of interest. We evaluate our work over a collection of tweets gathered in late 2012, exploring the efficiency and filtering characteristic of each processing step, both individually and collectively. Based on our experience, the resulting synopses from various time periods provide understandable and meaningful pictures of events within those periods, with potential application to tasks such as temporal summarization and query expansion for search

    GENERIC FRAMEWORKS FOR INTERACTIVE PERSONALIZED INTERESTING PATTERN DISCOVERY

    Get PDF
    The traditional frequent pattern mining algorithms generate an exponentially large number of patterns of which a substantial portion are not much significant for many data analysis endeavours. Due to this, the discovery of a small number of interesting patterns from the exponentially large number of frequent patterns according to a particular user\u27s interest is an important task. Existing works on patter

    "CHARACTERIZATION OF SLAUGHTERED AND NON-SLAUGHTERED GOAT MEAT AT LOW FREQUENCIES"

    Get PDF
    The electrical stimulation of meat has a high potential for use in the quality control of meat tissues during the past two decades. Dielectric spectroscopy is the most used technique to measure the electrical properties of tissues. Open ended coaxial cable or two parallel plates integrated with network analyzer, impedance analyzer or LCZ meter have been used to measure the dielectric properties of meat for different purposes. The purpose of this research is to construct a capacitive device capable of differentiating slaughtered and non-slaughtered goat meats, by determining the dielectric properties of goat meat at various frequencies and storage times. The detector cell has two circular platinum plates assembled on the micrometer barrel encased within a perspex box material to form the capacitor. The test rig is validated to insure it is working well. Two goats were slaughtered in the same environment. One of the goats was slaughtered properly (Islamic method) and the second one was killed by garrote. The measurements were done on the hindlimb muscles. The sizes of samples were 2 em diameter and 5 mm thick. The slaughtered and non-slaughtered meat samples were separately placed between the capacitor plates. The capacitance and dissipation factor were measured across the capacitor device which was connected to a LCR meter. The experiment was repeated for various frequencies (from I 00 Hz to 2 kHz), and at different storage times (at I day after slaughtering to 10 days). Maxwell Garnett mixing rule was applied to obtain the theoretical value of the effective permittivity by using goat muscle and blood permittivity. The results show that the device is able to differentiate slaughtered and non-slaughtered goat meat. At all applied frequencies, the relative permittivity of the non-slaughtered meat were clearly more than the relative permittivity of the slaughtered meat which agrees with the simulation results. The dissipation factor of the non-slaughtered meat was less than the dissipation factor of the slaughtered meat

    Frequent Itemset Mining for Big Data

    Get PDF
    Traditional data mining tools, developed to extract actionable knowledge from data, demonstrated to be inadequate to process the huge amount of data produced nowadays. Even the most popular algorithms related to Frequent Itemset Mining, an exploratory data analysis technique used to discover frequent items co-occurrences in a transactional dataset, are inefficient with larger and more complex data. As a consequence, many parallel algorithms have been developed, based on modern frameworks able to leverage distributed computation in commodity clusters of machines (e.g., Apache Hadoop, Apache Spark). However, frequent itemset mining parallelization is far from trivial. The search-space exploration, on which all the techniques are based, is not easily partitionable. Hence, distributed frequent itemset mining is a challenging problem and an interesting research topic. In this context, our main contributions consist in an (i) exhaustive theoretical and experimental analysis of the best-in-class approaches, whose outcomes and open issues motivated (ii) the development of a distributed high-dimensional frequent itemset miner. The dissertation introduces also a data mining framework which takes strongly advantage of distributed frequent itemset mining for the extraction of a specific type of itemsets (iii). The theoretical analysis highlights the challenges related to the distribution and the preliminary partitioning of the frequent itemset mining problem (i.e. the search-space exploration) describing the most adopted distribution strategies. The extensive experimental campaign, instead, compares the expectations related to the algorithmic choices against the actual performances of the algorithms. We run more than 300 experiments in order to evaluate and discuss the performances of the algorithms with respect to different real life use cases and data distributions. The outcomes of the review is that no algorithm is universally superior and performances are heavily skewed by the data distribution. Moreover, we were able to identify a concrete lack as regards frequent pattern extraction within high-dimensional use cases. For this reason, we have developed our own distributed high-dimensional frequent itemset miner based on Apache Hadoop. The algorithm splits the search-space exploration into independent sub-tasks. However, since the exploration strongly benefits of a full-knowledge of the problem, we introduced an interleaving synchronization phase. The result is a trade-off between the benefits of a centralized state and the ones related to the additional computational power due to parallelism. The experimental benchmarks, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and reliability to memory issues. Finally, the dissertation introduces a data mining framework in which distributed itemset mining is a fundamental component of the processing pipeline. The aim of the framework is the extraction of a new type of itemsets, called misleading generalized itemsets

    Summarization Techniques for Pattern Collections in Data Mining

    Get PDF
    Discovering patterns from data is an important task in data mining. There exist techniques to find large collections of many kinds of patterns from data very efficiently. A collection of patterns can be regarded as a summary of the data. A major difficulty with patterns is that pattern collections summarizing the data well are often very large. In this dissertation we describe methods for summarizing pattern collections in order to make them also more understandable. More specifically, we focus on the following themes: 1) Quality value simplifications. 2) Pattern orderings. 3) Pattern chains and antichains. 4) Change profiles. 5) Inverse pattern discovery.Comment: PhD Thesis, Department of Computer Science, University of Helsink

    Data Mining Algorithms for Internet Data: from Transport to Application Layer

    Get PDF
    Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets. The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.). In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data
    • …
    corecore