43 research outputs found

    Improved Methods for Extracting Frequent Itemsets from Interim-Support Trees

    Get PDF
    Mining association rules in relational databases is a significant computational task with lots of applications. A fundamental ingredient of this task is the discovery of sets of attributes (itemsets) whose frequency in the data exceeds some threshold value. In previous work [9] we have introduced an approach to this problem which begins by carrying out an efficient partial computation of the necessary totals, storing these interim results in a set-enumeration tree. This work demonstrated that making ∗ Aris Pagourtzis and Dora Souliou were partially supported for this research by “Pythagoras

    High Utility Itemsets Identification in Big Data

    Full text link
    High utility itemset mining is an important data mining problem which considers profit factors besides quantity from the transactional database. It helps find the most valuable products/items that are difficult to track using only the frequent data mining set. An item that has a high-profit value might be rare in the transactional database despite its tremendous importance. While there are many existing algorithms which generate comparatively large candidate sets while finding high utility itemsets, the major focus is to reduce the computational time significantly with the introduction of pruning strategies. Another aspect of high utility itemset mining is to compute the large dataset. There are very few algorithms that can handle a large dataset to find high utility itemset mining in a parallel (distributed) system. In this thesis, there are two proposed methods: 1) High utility itemset mining using pruning strategies approach (HUI-PR) and 2) Parallel EFIM (EFIM-Par). In the method I, the proposed algorithm constructs the candidate sets in the form of a tree structure, which traverses the itemsets with High Transaction-Weighted Utility (HTWUIs). It uses a pruning strategies to reduce the computational time by refraining the visit to unnecessary nodes of an itemset to reduce the search space. It significantly minimizes the transaction database generated on each node. In the method II, the distributed approach is proposed dividing the search space among different worker nodes to compute high utility itemsets which are aggregated to find the result. The experimental results for both methods show that they significantly improve the execution time for computing the high utility itemsets

    Doctor of Philosophy

    Get PDF
    dissertationWith the growing national dissemination of the electronic health record (EHR), there are expectations that the public will benefit from biomedical research and discovery enabled by electronic health data. Clinical data are needed for many diseases and conditions to meet the demands of rapidly advancing genomic and proteomic research. Many biomedical research advancements require rapid access to clinical data as well as broad population coverage. A fundamental issue in the secondary use of clinical data for scientific research is the identification of study cohorts of individuals with a disease or medical condition of interest. The problem addressed in this work is the need for generalized, efficient methods to identify cohorts in the EHR for use in biomedical research. To approach this problem, an associative classification framework was designed with the goal of accurate and rapid identification of cases for biomedical research: (1) a set of exemplars for a given medical condition are presented to the framework, (2) a predictive rule set comprised of EHR attributes is generated by the framework, and (3) the rule set is applied to the EHR to identify additional patients that may have the specified condition. iv Based on this functionality, the approach was termed the ‘cohort amplification' framework. The development and evaluation of the cohort amplification framework are the subject of this dissertation. An overview of the framework design is presented. Improvements to some standard associative classification methods are described and validated. A qualitative evaluation of predictive rules to identify diabetes cases and a study of the accuracy of identification of asthma cases in the EHR using frameworkgenerated prediction rules are reported. The framework demonstrated accurate and reliable rules to identify diabetes and asthma cases in the EHR and contributed to methods for identification of biomedical research cohorts

    Tree algorithms for mining association rules

    Get PDF
    With the increasing reliability of digital communication, the falling cost of hardware and increased computational power, the gathering and storage of data has become easier than at any other time in history. Commercial and public agencies are able to hold extensive records about all aspects of their operations. Witness the proliferation of point of sale (POS) transaction recording within retailing, digital storage of census data and computerized hospital records. Whilst the gathering of such data has uses in terms of answering specific queries and allowing visulisation of certain trends the volumes of data can hide significant patterns that would be impossible to locate manually. These patterns, once found, could provide an insight into customer behviour, demographic shifts and patient diagnosis hitherto unseen and unexpected. Remaining competitive in a modem business environment, or delivering services in a timely and cost effective manner for public services is a crucial part of modem economics. Analysis of the data held by an organisaton, by a system that "learns" can allow predictions to be made based on historical evidence. Users may guide the process but essentially the software is exploring the data unaided. The research described within this thesis develops current ideas regarding the exploration of large data volumes. Particular areas of research are the reduction of the search space within the dataset and the generation of rules which are deduced from the patterns within the data. These issues are discussed within an experimental framework which extracts information from binary data

    A new strategy for case-based reasoning retrieval using classification based on association

    Get PDF
    Cased Based Reasoning (CBR) is an important area of research in the field of Artificial Intelli-gence. It aims to solve new problems by adapting solutions, that were used to solve previous similar ones. Among the four typical phases - retrieval, reuse, revise and retain, retrieval is a key phase in CBR approach, as the retrieval of wrong cases can lead to wrong decisions. To ac-complish the retrieval process, a CBR system exploits Similarity-Based Retrieval (SBR). How-ever, SBR tends to depend strongly on similarity knowledge, ignoring other forms of knowledge, that can further improve retrieval performance.The aim of this study is to integrate class association rules (CARs) as a special case of associa-tion rules (ARs), to discover a set (of rules) that can form an accurate classifier in a database. It is an efficient method when used to build a classifier, where the target is pre-determined. The proposition for this research is to answer the question of whether CARs can be integrated into a CBR system. A new strategy is proposed that suggests and uses mining class association rules from previous cases, which could strengthen similarity based retrieval (SBR). The propo-sition question can be answered by adapting the pattern of CARs, to be compared with the end of the Retrieval phase. Previous experiments and their results to date, show a link between CARs and CBR cases. This link has been developed to achieve the aim and objectives.A novel strategy, Case-Based Reasoning using Association Rules (CBRAR) is proposed to improve the performance of the SBR and to disambiguate wrongly retrieved cases in CBR. CBRAR uses CARs to generate an optimum frequent pattern tree (FP-tree) which holds a val-ue of each node. The possible advantage offered is that more efficient results can be gained, when SBR returns uncertain answers. In addition, CBRAR has been evaluated using two sources of CBR frameworks - Jcolibri and Free CBR. With the experimental evaluation on real datasets indicating that the proposed CBRAR is a better approach when compared to CBR systems, offering higher accuracy and lower error rate

    Language-independent pre-processing of large document bases for text classification

    Get PDF
    Text classification is a well-known topic in the research of knowledge discovery in databases. Algorithms for text classification generally involve two stages. The first is concerned with identification of textual features (i.e. words andlor phrases) that may be relevant to the classification process. The second is concerned with classification rule mining and categorisation of "unseen" textual data. The first stage is the subject of this thesis and often involves an analysis of text that is both language-specific (and possibly domain-specific), and that may also be computationally costly especially when dealing with large datasets. Existing approaches to this stage are not, therefore, generally applicable to all languages. In this thesis, we examine a number of alternative keyword selection methods and phrase generation strategies, coupled with two potential significant word list construction mechanisms and two final significant word selection mechanisms, to identify such words andlor phrases in a given textual dataset that are expected to serve to distinguish between classes, by simple, language-independent statistical properties. We present experimental results, using common (large) textual datasets presented in two distinct languages, to show that the proposed approaches can produce good performance with respect to both classification accuracy and processing efficiency. In other words, the study presented in this thesis demonstrates the possibility of efficiently solving the traditional text classification problem in a language-independent (also domain-independent) manner

    Predictive Modelling of Retail Banking Transactions for Credit Scoring, Cross-Selling and Payment Pattern Discovery

    Get PDF
    Evaluating transactional payment behaviour offers a competitive advantage in the modern payment ecosystem, not only for confirming the presence of good credit applicants or unlocking the cross-selling potential between the respective product and service portfolios of financial institutions, but also to rule out bad credit applicants precisely in transactional payments streams. In a diagnostic test for analysing the payment behaviour, I have used a hybrid approach comprising a combination of supervised and unsupervised learning algorithms to discover behavioural patterns. Supervised learning algorithms can compute a range of credit scores and cross-sell candidates, although the applied methods only discover limited behavioural patterns across the payment streams. Moreover, the performance of the applied supervised learning algorithms varies across the different data models and their optimisation is inversely related to the pre-processed dataset. Subsequently, the research experiments conducted suggest that the Two-Class Decision Forest is an effective algorithm to determine both the cross-sell candidates and creditworthiness of their customers. In addition, a deep-learning model using neural network has been considered with a meaningful interpretation of future payment behaviour through categorised payment transactions, in particular by providing additional deep insights through graph-based visualisations. However, the research shows that unsupervised learning algorithms play a central role in evaluating the transactional payment behaviour of customers to discover associations using market basket analysis based on previous payment transactions, finding the frequent transactions categories, and developing interesting rules when each transaction category is performed on the same payment stream. Current research also reveals that the transactional payment behaviour analysis is multifaceted in the financial industry for assessing the diagnostic ability of promotion candidates and classifying bad credit applicants from among the entire customer base. The developed predictive models can also be commonly used to estimate the credit risk of any credit applicant based on his/her transactional payment behaviour profile, combined with deep insights from the categorised payment transactions analysis. The research study provides a full review of the performance characteristic results from different developed data models. Thus, the demonstrated data science approach is a possible proof of how machine learning models can be turned into cost-sensitive data models

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov
    corecore