60 research outputs found

    Data Mining and Hypothesis Refinement Using a Multi-Tiered Genetic Algorithm

    Get PDF
    This is the published version. Copyright De GruyterThis paper details a novel data mining technique that combines set objects with an enhanced genetic algorithm. By performing direct manipulation of sets, the encoding process used in genetic algorithms can be eliminated. The sets are used, manipulated, mutated, and combined, until a solution is reached. The contributions of this paper are two-fold: the development of a multi-tiered genetic algorithm technique, and its ability to perform not only data mining but also hypothesis refinement. The multi-tiered genetic algorithm is not only a closer approximation to genetics in the natural world, but also a method for combining the two main approaches for genetic algorithms in data mining, namely, the Pittsburg and Michigan approaches. These approaches were combined, and implemented. The experimental results showed that the developed system can be a successful data mining tool. More important, testing the hypothesis refinement capability of this approach illustrated that it could take a data model generated by some other technique and improves upon the overall performance of the data model

    A comparison of sixteen classification strategies of rule induction from incomplete data using the MLEM2 algorithm

    Get PDF
    In data mining, rule induction is a process of extracting formal rules from decision tables, where the later are the tabulated observations, which typically consist of few attributes, i.e., independent variables and a decision, i.e., a dependent variable. Each tuple in the table is considered as a case, and there could be n number of cases for a table specifying each observation. The efficiency of the rule induction depends on how many cases are successfully characterized by the generated set of rules, i.e., ruleset. There are different rule induction algorithms, such as LEM1, LEM2, MLEM2. In the real world, datasets will be imperfect, inconsistent, and incomplete. MLEM2 is an efficient algorithm to deal with such sorts of data, but the quality of rule induction largely depends on the chosen classification strategy. We tried to compare the 16 classification strategies of rule induction using MLEM2 on incomplete data. For this, we implemented MLEM2 for inducing rulesets based on the selection of the type of approximation, i.e., singleton, subset or concept, and the value of alpha for calculating probabilistic approximations. A program called rule checker is used to calculate the error rate based on the classification strategy specified. To reduce the anomalies, we used ten-fold cross-validation to measure the error rate for each classification. Error rates for the above strategies are being calculated for different datasets, compared, and presented

    A Multi-Tiered Genetic Algorithm for Data Mining and Hypothesis Refinement

    Get PDF
    While there are many approaches to data mining, it seems that there is a hole in the ability to make use of the advantages of multiple techniques. There are many methods that use rigid heuristics and guidelines in constructing rules for data, and are thus limited in their ability to describe patterns. Genetic algorithms provide a more flexible approach, and yet the genetic algorithms that have been employed don't capitalize on the fact that data models have two levels: individual rules and the overall data model. This dissertation introduces a multi-tiered genetic algorithm capable of evolving individual rules and the data model at the same time. The multi-tiered genetic algorithm also provides a means for taking advantage of the strengths of the more rigid methods by using their output as input to the genetic algorithm. Most genetic algorithms use a single "roulette wheel" approach. As such, they are only able to select either good data models or good rules, but are incapable of selecting for both simultaneously. With the additional roulette wheel of the multi-tiered genetic algorithm, the fitness of both rules and data models can be evaluated, enabling the algorithm to select good rules from good data models. This also more closely emulates how genes are passed from parents to children in actual biology. Consequently, this technique strengthens the "genetics" of genetic algorithms. For ease of discussion, the multi-tiered genetic algorithm has been named "Arcanum." This technique was tested on thirteen data sets obtained from The University of California Irvine Knowledge Discovery in Databases Archive. Results for these same data sets were gathered for GAssist, another genetic algorithm designed for data mining, and J4.8, the WEKA implementation of C4.5. While both of the other techniques outperformed Arcanum overall, it was able to provide comparable or better results for 5 of the 13 data sets, indicating that the algorithm can be used for data mining, although it needs improvement. The second stage of testing was on the ability to take results from a previous algorithm and perform refinement on the data model. Initially, Arcanum was used to refine its own data models. Of the six data models used for hypothesis refinement, Arcanum was able to improve upon 3 of them. Next, results from the LEM2 algorithm were used as input to Arcanum. Of the three data models used from LEM2, Arcanum was able to improve upon all three data models by sacrificing accuracy in order to improve coverage, resulting in a better data model overall. The last phase of hypothesis refinement was performed upon C4.5. It required several attempts, each using different parameters, but Arcanum was finally able to make a slight improvement to the C4.5 data model. From the experimental results, Arcanum was shown to yield results comparable to GAssist and C4.5 on some of the data sets. It was also able to take data models from three different techniques and improve upon them. While there is certainly room for improvement of the multi-tiered genetic algorithm described in this dissertation, the experimental evidence supports the claims that it can perform both data mining and hypothesis refinement of data models from other data mining techniques

    Rough-set based learning methods: A case study to assess the relationship between the clinical delivery of cannabinoid medicine for anxiety, depression, sleep, patterns and predictability

    Get PDF
    COVID-19 is an unprecedented health crisis causing a great deal of stress and mental health challenges in populations in Canada. Recently, research is emerging highlighting the potential of cannabinoids’ beneficial effects related to anxiety, mood, and sleep disorders as well as pointing to an increased use of medicinal cannabis since COVID-19 was declared a pandemic. Furthermore, evidence points to a correlation between mental health and sleep patterns. The objective of this research is threefold: i) to assess the relationship of the clinical delivery of cannabinoid medicine, by utilizing machine learning, to anxiety, depression and sleep scores; ii) to discover patterns based on patient features such as specific cannabis recommendations, diagnosis information, decreasing/increasing levels of clinical assessment tools (GAD7, PHQ9 and PSQI) scores over a period of time (including during the COVID timeline); and iii) to predict whether new patients could potentially experience either an increase or decrease in clinical assessment tool scores. The dataset for this thesis was derived from patient visits to Ekosi Health Centres in Manitoba, Canada and Ontario, Canada from January, 2019 to April, 2021. Extensive pre-processing and feature engineering was performed. To determine the outcome of a patients treatment, a class feature (Worse, Better, or No Change) indicative of their progress or lack thereof due to the treatment received was introduced. Three well-known supervised machine learning models (tree-based, rule-based and nearest neighbour) were trained on the patient dataset. In addition, seven rough and rough-fuzzy hybrid methods were also trained on the same dataset. All experiments were conducted using a 10-fold CV method. Sensitivity and specificity measures were higher in all classes with rough and rough-fuzzy hybrid methods. The highest accuracy of 99.15% was obtained using the rule-based rough-set learning method.Ekosi Health Center, MitacsMaster of Science in Applied Computer Scienc

    Finding Optimal Reduct for Rough Sets by Using a Decision Tree Learning Algorithm

    Get PDF
    Rough Set theory is a mathematical theory for classification based on structural analysis of relational data. It can be used to find the minimal reduct. Minimal reduct is the minimal knowledge representation for the relational data. The theory has been successfully applied to various domains in data mining. However, a major limitation in Rough Set theory is that finding the minimal reduct is an NP-hard problem. C4.5 is a very popular decision tree-learning algorithm. It is very efficient at generating a decision tree. This project uses the decision tree generated by C4.5 to find the optimal reduct for a relational table. This method does not guarantee finding a minimal reduct, but test results show that the optimal reduct generated by this approach is equivalent or very close to the minimal reduct
    • …
    corecore