62 research outputs found

    A study on incremental mining of frequent patterns

    Get PDF
    Data generated from both the offline and online sources are incremental in nature. Changes in the underlying database occur due to the incremental data. Mining frequent patterns are costly in changing databases, since it requires scanning the database from the start. Thus, mining of growing databases has been a great concern. To mine the growing databases, a new Data Mining technique called Incremental Mining has emerged. The Incremental Mining uses previous mining result to get the desired knowledge by reducing mining costs in terms of time and space. This state of the art paper focuses on Incremental Mining approaches and identifies suitable approaches which are the need of real world problem.Keywords: Data Mining, Frequent Pattern, Incremental Mining, Frequent Pattern Minung, High Utility Mining, Constraint Mining

    Improved Visualization of Frequent Itemset Relationships Using the Minimal Spanning Tree Algorithm

    Get PDF
    Descriptive data mining techniques offer a way of extracting useful information out of large datasets and presenting it in an interpretable fashion to be used as a basis for future decisions. Since users interpret information most easily through visual means, techniques which produce concise, visually attractive results are usually preferred. We define a method, which converts transactional data into tree-like data structures, which depict important relationships between items contained in this data. The new approach we propose is offering a way to mitigate the loss of information present in previously developed algorithms, which use mined frequent itemsets and construct tree structures. We transfer the problem to the domain of graph theory and through minimal spanning tree construction achieve more informative visualizations. We highlight the new approach with comparison to previous ones by applying it on a real-life datasets – one connected to market basket data and the other from the educational domain

    Fouille de séquences temporelles pour la maintenance prédictive : application aux données de véhicules traceurs ferroviaires

    Get PDF
    In order to meet the mounting social and economic demands, railway operators and manufacturers are striving for a longer availability and a better reliability of railway transportation systems. Commercial trains are being equipped with state-of-the-art onboard intelligent sensors monitoring various subsystems all over the train. These sensors provide real-time flow of data, called floating train data, consisting of georeferenced events, along with their spatial and temporal coordinates. Once ordered with respect to time, these events can be considered as long temporal sequences which can be mined for possible relationships. This has created a neccessity for sequential data mining techniques in order to derive meaningful associations rules or classification models from these data. Once discovered, these rules and models can then be used to perform an on-line analysis of the incoming event stream in order to predict the occurrence of target events, i.e, severe failures that require immediate corrective maintenance actions. The work in this thesis tackles the above mentioned data mining task. We aim to investigate and develop various methodologies to discover association rules and classification models which can help predict rare tilt and traction failures in sequences using past events that are less critical. The investigated techniques constitute two major axes: Association analysis, which is temporal and Classification techniques, which is not temporal. The main challenges confronting the data mining task and increasing its complexity are mainly the rarity of the target events to be predicted in addition to the heavy redundancy of some events and the frequent occurrence of data bursts. The results obtained on real datasets collected from a fleet of trains allows to highlight the effectiveness of the approaches and methodologies usedDe nos jours, afin de répondre aux exigences économiques et sociales, les systèmes de transport ferroviaire ont la nécessité d'être exploités avec un haut niveau de sécurité et de fiabilité. On constate notamment un besoin croissant en termes d'outils de surveillance et d'aide à la maintenance de manière à anticiper les défaillances des composants du matériel roulant ferroviaire. Pour mettre au point de tels outils, les trains commerciaux sont équipés de capteurs intelligents envoyant des informations en temps réel sur l'état de divers sous-systèmes. Ces informations se présentent sous la forme de longues séquences temporelles constituées d'une succession d'événements. Le développement d'outils d'analyse automatique de ces séquences permettra d'identifier des associations significatives entre événements dans un but de prédiction d'événement signant l'apparition de défaillance grave. Cette thèse aborde la problématique de la fouille de séquences temporelles pour la prédiction d'événements rares et s'inscrit dans un contexte global de développement d'outils d'aide à la décision. Nous visons à étudier et développer diverses méthodes pour découvrir les règles d'association entre événements d'une part et à construire des modèles de classification d'autre part. Ces règles et/ou ces classifieurs peuvent ensuite être exploités pour analyser en ligne un flux d'événements entrants dans le but de prédire l'apparition d'événements cibles correspondant à des défaillances. Deux méthodologies sont considérées dans ce travail de thèse: La première est basée sur la recherche des règles d'association, qui est une approche temporelle et une approche à base de reconnaissance de formes. Les principaux défis auxquels est confronté ce travail sont principalement liés à la rareté des événements cibles à prédire, la redondance importante de certains événements et à la présence très fréquente de "bursts". Les résultats obtenus sur des données réelles recueillies par des capteurs embarqués sur une flotte de trains commerciaux permettent de mettre en évidence l'efficacité des approches proposée

    Big data analytics for preventive medicine

    Get PDF
    © 2019, Springer-Verlag London Ltd., part of Springer Nature. Medical data is one of the most rewarding and yet most complicated data to analyze. How can healthcare providers use modern data analytics tools and technologies to analyze and create value from complex data? Data analytics, with its promise to efficiently discover valuable pattern by analyzing large amount of unstructured, heterogeneous, non-standard and incomplete healthcare data. It does not only forecast but also helps in decision making and is increasingly noticed as breakthrough in ongoing advancement with the goal is to improve the quality of patient care and reduces the healthcare cost. The aim of this study is to provide a comprehensive and structured overview of extensive research on the advancement of data analytics methods for disease prevention. This review first introduces disease prevention and its challenges followed by traditional prevention methodologies. We summarize state-of-the-art data analytics algorithms used for classification of disease, clustering (unusually high incidence of a particular disease), anomalies detection (detection of disease) and association as well as their respective advantages, drawbacks and guidelines for selection of specific model followed by discussion on recent development and successful application of disease prevention methods. The article concludes with open research challenges and recommendations

    Techniques in data mining: decision trees classification and constraint-based itemsets mining.

    Get PDF
    Cheung, Yin-ling.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 117-124).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Data Mining Techniques --- p.1Chapter 1.1.1 --- Classification --- p.1Chapter 1.1.2 --- Association Rules Mining --- p.2Chapter 1.1.3 --- Estimation --- p.2Chapter 1.1.4 --- Prediction --- p.2Chapter 1.1.5 --- Clustering --- p.2Chapter 1.1.6 --- Description --- p.3Chapter 1.2 --- Problem Definition --- p.3Chapter 1.3 --- Thesis Organization --- p.4Chapter I --- Decision Tree Classifiers --- p.6Chapter 2 --- Background --- p.7Chapter 2.1 --- Introduction to Classification --- p.7Chapter 2.2 --- Classification Using Decision Trees --- p.8Chapter 2.2.1 --- Constructing a Decision Tree --- p.10Chapter 2.2.2 --- Related Work --- p.11Chapter 3 --- Strategies to Enhance the Performance in Building Decision Trees --- p.14Chapter 3.1 --- Introduction --- p.15Chapter 3.1.1 --- Related Work --- p.15Chapter 3.1.2 --- Post-evaluation vs Pre-evaluation of Splitting Points --- p.19Chapter 3.2 --- Schemes to Construct Decision Trees --- p.27Chapter 3.2.1 --- One-to-many Hashing --- p.27Chapter 3.2.2 --- Many-to-one and Horizontal Hashing --- p.28Chapter 3.2.3 --- A Scheme using Paired Attribute Lists --- p.29Chapter 3.2.4 --- A Scheme using Database Replication --- p.31Chapter 3.3 --- Performance Analysis --- p.32Chapter 3.4 --- Experimental Results --- p.38Chapter 3.4.1 --- Performance --- p.38Chapter 3.4.2 --- Test 1 : Smaller Decision Tree --- p.40Chapter 3.4.3 --- Test 2: Bigger Decision Tree --- p.44Chapter 3.5 --- Conclusion --- p.47Chapter II --- Mining Association Rules --- p.48Chapter 4 --- Background --- p.49Chapter 4.1 --- Definition --- p.49Chapter 4.2 --- Association Algorithms --- p.51Chapter 4.2.1 --- Apriori-gen --- p.51Chapter 4.2.2 --- Partition --- p.53Chapter 4.2.3 --- DIC --- p.54Chapter 4.2.4 --- FP-tree --- p.54Chapter 4.2.5 --- Vertical Data Mining --- p.58Chapter 4.3 --- Taxonomies of Association Rules --- p.58Chapter 4.3.1 --- Multi-level Association Rules --- p.58Chapter 4.3.2 --- Multi-dimensional Association Rules --- p.59Chapter 4.3.3 --- Quantitative Association Rules --- p.59Chapter 4.3.4 --- Random Sampling --- p.60Chapter 4.3.5 --- Constraint-based Association Rules --- p.60Chapter 5 --- Mining Association Rules without Support Thresholds --- p.62Chapter 5.1 --- Introduction --- p.63Chapter 5.1.1 --- Itemset-Loop --- p.66Chapter 5.2 --- New Approaches --- p.67Chapter 5.2.1 --- "A Build-Once and Mine-Once Approach, BOMO" --- p.68Chapter 5.2.2 --- "A Loop-back Approach, LOOPBACK" --- p.74Chapter 5.2.3 --- "A Build-Once and Loop-Back Approach, BOLB" --- p.77Chapter 5.2.4 --- Discussion --- p.77Chapter 5.3 --- Generalization: Varying Thresholds Nk for k-itemsets --- p.78Chapter 5.4 --- Performance Evaluation --- p.78Chapter 5.4.1 --- Generalization: Varying Nk for k-itemsets --- p.84Chapter 5.4.2 --- Non-optimal Thresholds --- p.84Chapter 5.4.3 --- "Different Decrease Factors,f" --- p.85Chapter 5.5 --- Conclusion --- p.87Chapter 6 --- Mining Interesting Itemsets with Item Constraints --- p.88Chapter 6.1 --- Introduction --- p.88Chapter 6.2 --- Proposed Algorithms --- p.91Chapter 6.2.1 --- Single FP-tree Approach --- p.92Chapter 6.2.2 --- Double FP-trees Approaches --- p.93Chapter 6.3 --- Maximum Support Thresholds --- p.102Chapter 6.4 --- Performance Evaluation --- p.103Chapter 6.5 --- Conclusion --- p.109Chapter 7 --- Conclusion --- p.110Chapter A --- Probabilistic Analysis of Hashing Schemes --- p.112Chapter B --- Hash Functions --- p.114Bibliography --- p.11

    Using Big Data Analytics and Statistical Methods for Improving Drug Safety

    Get PDF
    This dissertation includes three studies, all focusing on utilizing Big Data and statistical methods for improving one of the most important aspects of health care, namely drug safety. In these studies we develop data analytics methodologies to inspect, clean, and model data with the aim of fulfilling the three main goals of drug safety; detection, understanding, and prediction of adverse drug effects.In the first study, we develop a methodology by combining both analytics and statistical methods with the aim of detecting associations between drugs and adverse events through historical patients' records. Particularly we show applicability of the developed methodology by focusing on investigating potential confounding role of common diabetes drugs on developing acute renal failure in diabetic patients. While traditional methods of signal detection mostly consider one drug and one adverse event at a time for investigation, our proposed methodology takes into account the effect of drug-drug interactions by identifying groups of drugs frequently prescribed together.In the second study, two independent methodologies are developed to investigate the role of prescription sequence factor on the likelihood of developing adverse events. In fact, this study focuses on using data analytics for understanding drug-event associations. Our analyses on the historical medication records of a group of diabetic patients using the proposed approaches revealed that the sequence in which the drugs are prescribed, and administered, significantly do matter in the development of adverse events associated with those drugs.The third study uses a chronological approach to develop a network of approved drugs and their known adverse events. It then utilizes a set of network metrics, both similarity- and centrality-based, to build and train machine learning predictive models and predict the likely adverse events for the newly discovered drugs before their approval and introduction to the market. For this purpose, data of known drug-event associations from a large biomedical publication database (i.e., PubMed) is employed to construct the network. The results indicate significant improvements in terms of accuracy of prediction of drug-evet associations compared with similar approaches

    Mining social mixing patterns for infectious disease models based on a two-day population survey in Belgium

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Until recently, mathematical models of person to person infectious diseases transmission had to make assumptions on transmissions enabled by personal contacts by estimating the so-called WAIFW-matrix. In order to better inform such estimates, a population based contact survey has been carried out in Belgium over the period March-May 2006. In contrast to other European surveys conducted simultaneously, each respondent recorded contacts over two days. Special attention was given to holiday periods, and respondents with large numbers of professional contacts.</p> <p>Methods</p> <p>Participants kept a paper diary with information on their contacts over two different days. A contact was defined as a two-way conversation of at least three words in each others proximity. The contact information included the age of the contact, gender, location, duration, frequency, and whether or not touching was involved.</p> <p>For data analysis, we used association rules and classification trees. Weighted generalized estimating equations were used to analyze contact frequency while accounting for the correlation between contacts reported on the two different days.</p> <p>A contact surface, expressing the average number of contacts between persons of different ages was obtained by a bivariate smoothing approach and the relation to the so-called next-generation matrix was established.</p> <p>Results</p> <p>People mostly mixed with people of similar age, or with their offspring, their parents and their grandparents. By imputing professional contacts, the average number of daily contacts increased from 11.84 to 15.70. The number of reported contacts depended heavily on the household size, class size for children and number of professional contacts for adults. Adults living with children had on average 2 daily contacts more than adults living without children. In the holiday period, the daily contact frequency for children and adolescents decreased with about 19% while a similar observation is made for adults in the weekend. These findings can be used to estimate the impact of school closure.</p> <p>Conclusion</p> <p>We conducted a diary based contact survey in Belgium to gain insights in social interactions relevant to the spread of infectious diseases. The resulting contact patterns are useful to improve estimating crucial parameters for infectious disease transmission models.</p
    • …
    corecore