5,566 research outputs found

    An efficient closed frequent itemset miner for the MOA stream mining system

    Get PDF
    Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version

    An evolutionary model to mine high expected utility patterns from uncertain databases

    Get PDF
    In recent decades, mobile or the Internet of Thing (IoT) devices are dramatically increasing in many domains and applications. Thus, a massive amount of data is generated and produced. Those collected data contain a large amount of interesting information (i.e., interestingness, weight, frequency, or uncertainty), and most of the existing and generic algorithms in pattern mining only consider the single object and precise data to discover the required information. Meanwhile, since the collected information is huge, and it is necessary to discover meaningful and up-to-date information in a limit and particular time. In this paper, we consider both utility and uncertainty as the majority objects to efficiently mine the interesting high expected utility patterns (HEUPs) in a limit time based on the multi-objective evolutionary framework. The benefits of the designed model (called MOEA-HEUPM) can discover the valuable HEUPs without pre-defined threshold values (i.e., minimum utility and minimum uncertainty) in the uncertain environment. Two encoding methodologies are also considered in the developed MOEA-HEUPM to show its effectiveness. Based on the developed MOEA-HEUPM model, the set of non-dominated HEUPs can be discovered in a limit time for decision-making. Experiments are then conducted to show the effectiveness and efficiency of the designed MOEA-HEUPM model in terms of convergence, hypervolume and number of the discovered patterns compared to the generic approaches.acceptedVersio

    MINING FREQUENT PATTERNS FROM PRECISE AND UNCERTAIN DATA // MINERAÇÃO DE PADRÕES FREQUENTES A PARTIR DE DADOS PRECISOS E INCERTOS

    Get PDF
    Data mining has gained popularity over the past two decades and has been considered one of the most prominent areas of current database research. Common data mining tasks include finding frequent patterns, clustering and classifying objects, as well as detecting anomalies. To handle these tasks, techniques from different fields—such as database systems, machine learning, statistics, information retrieval, and data visualization—are applied to provide business intelligent (BI) solutions to various real-life problems. In this survey, we focus on the task of frequent pattern mining, which non-trivially extracts implicit, previously unknown and potentially useful information in the form of frequently occurring sets of items. Mined frequent patterns can be considered as building blocks for association rules, which help reveal associative relationships between items or events on the antecedent and the consequent of rules. Here, we describe some classical algorithms, as well as some recent innovative algorithms, for mining precise data (in which users are certain about the presence or absence of data items) and uncertain data (in which users are uncertain about the presence or absence of data items and they only know that data items probably occur). Mineração de Dados ganhou popularidade nas últimas duas décadas e tem sido considerada uma das mais proeminentes áreas dentro da área de Banco de Dados. Dentre as tarefas comumente realizadas em mineração de dados encontram-se busca de padrões frequentes, clusterização e classificação de objetos, como também detecção de anomalias. Para manipular estas tarefas, técnicas de diferentes campos – tais como sistemas de banco de dados, máquinas de aprendizado, estatística, recuperação de informações e visualização de dados – são aplicadas para oferecer soluções para problemas em nível de Business Intelligent (BI). Nesta pesquisa, nós focamos em tarefas relacionadas a mineração de padrões frequentes, que implica na extração de informações potencialmente úteis, não triviais e previamente desconhecidas, na forma de ocorrências de conjunto de itens frequentes. Mineração de padrões frequentes pode ser considerados como blocos de informações para a construção de regras de associação, os quais auxiliam na identificação de relacionamentos entre itens ou eventos que participam das partes antecedente e consequente de uma regra. Neste trabalho são descritos alguns algoritmos clássicos, como também alguns algoritmos inovadores recentes, para mineração de dados precisos (para os quais o usuário têm certeza da presença ou ausência dos itens de dados) e dados incertos (para os quais usuários tem somente uma certeza probabilística da presença ou ausência de determinados itens de dados)

    Modern Approaches to Uncertain Database Exploration from Categorizing Data to Advanced Mining Solutions

    Get PDF
    In today's digitized era, the ubiquity of data from diverse sources has introduced unique challenges in database management, notably the issue of data uncertainty. Uncertainty in databases can arise from various factors – sensor inaccuracies, human input errors, or inherent vagueness in data interpretation. Addressing these challenges, this research delves into modern approaches to uncertain database exploration. The paper begins by exploring methods for categorizing data based on certainty levels, emphasizing the importance and mechanisms to distinguish between certain and uncertain data. The discussion then transitions to highlight pioneering mining solutions that enhance the utility of uncertain databases. By integrating state-of-the-art techniques with traditional database management principles, this study aims to bolster the reliability, efficiency, and versatility of data mining in uncertain contexts. The implications of these methods, both theoretically and in real-world applications, hold the potential to redefine how uncertain data is perceived and utilized in diverse sectors, from healthcare to finance

    Mining Predictive Patterns and Extension to Multivariate Temporal Data

    Get PDF
    An important goal of knowledge discovery is the search for patterns in the data that can help explaining its underlying structure. To be practically useful, the discovered patterns should be novel (unexpected) and easy to understand by humans. In this thesis, we study the problem of mining patterns (defining subpopulations of data instances) that are important for predicting and explaining a specific outcome variable. An example is the task of identifying groups of patients that respond better to a certain treatment than the rest of the patients. We propose and present efficient methods for mining predictive patterns for both atemporal and temporal (time series) data. Our first method relies on frequent pattern mining to explore the search space. It applies a novel evaluation technique for extracting a small set of frequent patterns that are highly predictive and have low redundancy. We show the benefits of this method on several synthetic and public datasets. Our temporal pattern mining method works on complex multivariate temporal data, such as electronic health records, for the event detection task. It first converts time series into time-interval sequences of temporal abstractions and then mines temporal patterns backwards in time, starting from patterns related to the most recent observations. We show the benefits of our temporal pattern mining method on two real-world clinical tasks

    Doctor of Philosophy

    Get PDF
    dissertationWith the growing national dissemination of the electronic health record (EHR), there are expectations that the public will benefit from biomedical research and discovery enabled by electronic health data. Clinical data are needed for many diseases and conditions to meet the demands of rapidly advancing genomic and proteomic research. Many biomedical research advancements require rapid access to clinical data as well as broad population coverage. A fundamental issue in the secondary use of clinical data for scientific research is the identification of study cohorts of individuals with a disease or medical condition of interest. The problem addressed in this work is the need for generalized, efficient methods to identify cohorts in the EHR for use in biomedical research. To approach this problem, an associative classification framework was designed with the goal of accurate and rapid identification of cases for biomedical research: (1) a set of exemplars for a given medical condition are presented to the framework, (2) a predictive rule set comprised of EHR attributes is generated by the framework, and (3) the rule set is applied to the EHR to identify additional patients that may have the specified condition. iv Based on this functionality, the approach was termed the ‘cohort amplification' framework. The development and evaluation of the cohort amplification framework are the subject of this dissertation. An overview of the framework design is presented. Improvements to some standard associative classification methods are described and validated. A qualitative evaluation of predictive rules to identify diabetes cases and a study of the accuracy of identification of asthma cases in the EHR using frameworkgenerated prediction rules are reported. The framework demonstrated accurate and reliable rules to identify diabetes and asthma cases in the EHR and contributed to methods for identification of biomedical research cohorts
    • …
    corecore