22 research outputs found

    A Comparison of the Quality of Rule Induction from Inconsistent Data Sets and Incomplete Data Sets

    Get PDF
    In data mining, decision rules induced from known examples are used to classify unseen cases. There are various rule induction algorithms, such as LEM1 (Learning from Examples Module version 1), LEM2 (Learning from Examples Module version 2) and MLEM2 (Modified Learning from Examples Module version 2). In the real world, many data sets are imperfect, either inconsistent or incomplete. The idea of lower and upper approximations, or more generally, the probabilistic approximation, provides an effective way to induce rules from inconsistent data sets and incomplete data sets. But the accuracies of rule sets induced from imperfect data sets are expected to be lower. The objective of this project is to investigate which kind of imperfect data sets (inconsistent or incomplete) is worse in terms of the quality of rule induction. In this project, experiments were conducted on eight inconsistent data sets and eight incomplete data sets with lost values. We implemented the MLEM2 algorithm to induce certain and possible rules from inconsistent data sets, and implemented the local probabilistic version of MLEM2 algorithm to induce certain and possible rules from incomplete data sets. A program called Rule Checker was also developed to classify unseen cases with induced rules and measure the classification error rate. Ten-fold cross validation was carried out and the average error rate was used as the criterion for comparison. The Mann-Whitney nonparametric tests were performed to compare, separately for certain and possible rules, incompleteness with inconsistency. The results show that there is no significant difference between inconsistent and incomplete data sets in terms of the quality of rule induction

    HANDLING MISSING ATTRIBUTE VALUES IN DECISION TABLES USING VALUED TOLERANCE APPROACH

    Get PDF
    Rule induction is one of the key areas in data mining as it is applied to a large number of real life data. However, in such real life data, the information is incompletely specified most of the time. To induce rules from these incomplete data, more powerful algorithms are necessary. This research work mainly focuses on a probabilistic approach based on the valued tolerance relation. This thesis is divided into two parts. The first part describes the implementation of the valued tolerance relation. The induced rules are then evaluated based on the error rate due to incorrectly classified and unclassified examples. The second part of this research work shows a comparison of the rules induced by the MLEM2 algorithm that has been implemented before, with the rules induced by the valued tolerance based approach which was implemented as part of this research. Hence, through this thesis, the error rate for the MLEM2 algorithm and the valued tolerance based approach are compared and the results are documented

    Combining genetic algorithm with machine learning strategies for designing potent antimicrobial peptides

    Get PDF
    Background Current methods in machine learning provide approaches for solving challenging, multiple constraint design problems. While deep learning and related neural networking methods have state-of-the-art performance, their vulnerability in decision making processes leading to irrational outcomes is a major concern for their implementation. With the rising antibiotic resistance, antimicrobial peptides (AMPs) have increasingly gained attention as novel therapeutic agents. This challenging design problem requires peptides which meet the multiple constraints of limiting drug-resistance in bacteria, preventing secondary infections from imbalanced microbial flora, and avoiding immune system suppression. AMPs offer a promising, bioinspired design space to targeting antimicrobial activity, but their versatility also requires the curated selection from a combinatorial sequence space. This space is too large for brute-force methods or currently known rational design approaches outside of machine learning. While there has been progress in using the design space to more effectively target AMP activity, a widely applicable approach has been elusive. The lack of transparency in machine learning has limited the advancement of scientific knowledge of how AMPs are related among each other, and the lack of general applicability for fully rational approaches has limited a broader understanding of the design space. Methods Here we combined an evolutionary method with rough set theory, a transparent machine learning approach, for designing antimicrobial peptides (AMPs). Our method achieves the customization of AMPs using supervised learning boundaries. Our system employs in vitro bacterial assays to measure fitness, codon-representation of peptides to gain flexibility of sequence selection in DNA-space with a genetic algorithm and machine learning to further accelerate the process. Results We use supervised machine learning and a genetic algorithm to find a peptide active against S. epidermidis, a common bacterial strain for implant infections, with an improved aggregation propensity average for an improved ease of synthesis. Conclusions Our results demonstrate that AMP design can be customized to maintain activity and simplify production. To our knowledge, this is the first time when codon-based genetic algorithms combined with rough set theory methods is used for computational search on peptide sequences

    A comparison of sixteen classification strategies of rule induction from incomplete data using the MLEM2 algorithm

    Get PDF
    In data mining, rule induction is a process of extracting formal rules from decision tables, where the later are the tabulated observations, which typically consist of few attributes, i.e., independent variables and a decision, i.e., a dependent variable. Each tuple in the table is considered as a case, and there could be n number of cases for a table specifying each observation. The efficiency of the rule induction depends on how many cases are successfully characterized by the generated set of rules, i.e., ruleset. There are different rule induction algorithms, such as LEM1, LEM2, MLEM2. In the real world, datasets will be imperfect, inconsistent, and incomplete. MLEM2 is an efficient algorithm to deal with such sorts of data, but the quality of rule induction largely depends on the chosen classification strategy. We tried to compare the 16 classification strategies of rule induction using MLEM2 on incomplete data. For this, we implemented MLEM2 for inducing rulesets based on the selection of the type of approximation, i.e., singleton, subset or concept, and the value of alpha for calculating probabilistic approximations. A program called rule checker is used to calculate the error rate based on the classification strategy specified. To reduce the anomalies, we used ten-fold cross-validation to measure the error rate for each classification. Error rates for the above strategies are being calculated for different datasets, compared, and presented

    Indecision Trees: Learning Argument-Based Reasoning under Quantified Uncertainty

    Full text link
    Using Machine Learning systems in the real world can often be problematic, with inexplicable black-box models, the assumed certainty of imperfect measurements, or providing a single classification instead of a probability distribution. This paper introduces Indecision Trees, a modification to Decision Trees which learn under uncertainty, can perform inference under uncertainty, provide a robust distribution over the possible labels, and can be disassembled into a set of logical arguments for use in other reasoning systems.Comment: 12 pages, 1 figur

    Bionano-Interfaces through Peptide Design

    Get PDF
    The clinical success of restoring bone and tooth function through implants critically depends on the maintenance of an infection-free, integrated interface between the host tissue and the biomaterial surface. The surgical site infections, which are the infections within one year of surgery, occur in approximately 160,000-300,000 cases in the US annually. Antibiotics are the conventional treatment for the prevention of infections. They are becoming ineffective due to bacterial antibiotic-resistance from their wide-spread use. There is an urgent need both to combat bacterial drug resistance through new antimicrobial agents and to limit the spread of drug resistance by limiting their delivery to the implant site. This work aims to reduce surgical site infections from implants by designing of chimeric antimicrobial peptides to integrate a novel and effective delivery method. In recent years, antimicrobial peptides (AMPs) have attracted interest as natural sources for new antimicrobial agents. By being part of the immune system in all life forms, they are examples of antibacterial agents with successfully maintained efficacy across evolutionary time. Both natural and synthetic AMPs show significant promise for solving the antibiotic resistance problems. In this work, AMP1 and AMP2 was shown to be active against three different strains of pathogens in Chapter 4. In the literature, these peptides have been shown to be effective against multi-drug resistant bacteria. However, their effective delivery to the implantation site limits their clinical use. In recent years, different groups adapted covalent chemistry-based or non-specific physical adsorption methods for antimicrobial peptide coatings on implant surfaces. Many of these procedures use harsh chemical conditions requiring multiple reaction steps. Furthermore, none of these methods allow the orientation control of these molecules on the surfaces, which is an essential consideration for biomolecules. In the last few decades, solid binding peptides attracted high interest due to their material specificity and self-assembly properties. These peptides offer robust surface adsorption and assembly in diverse applications. In this work, a design method for chimeric antimicrobial peptides that can self-assemble and self-orient onto biomaterial surfaces was demonstrated. Three specific aims used to address this two-fold strategy of self-assembly and self-orientation are: 1) Develop classification and design methods using rough set theory and genetic algorithm search to customize antibacterial peptides; 2) Develop chimeric peptides by designing spacer sequences to improve the activity of antimicrobial peptides on titanium surfaces; 3) Verify the approach as an enabling technology by expanding the chimeric design approach to other biomaterials. In Aim 1, a peptide classification tool was developed because the selection of an antimicrobial peptide for an application was difficult among the thousands of peptide sequences available. A rule-based rough-set theory classification algorithm was developed to group antimicrobial peptides by chemical properties. This work is the first time that rough set theory has been applied to peptide activity analysis. The classification method on benchmark data sets resulted in low false discovery rates. The novel rough set theory method was combined with a novel genetic algorithm search, resulting in a method for customizing active antibacterial peptides using sequence-based relationships. Inspired by the fact that spacer sequences play critical roles between functional protein domains, in Aim 2, chimeric peptides were designed to combine solid binding functionality with antimicrobial functionality. To improve how these functions worked together in the same peptide sequence, new spacer sequences were engineered. The rough set theory method from Aim 1 was used to find structure-based relationships to discover new spacer sequences which improved the antimicrobial activity of the chimeric peptides. In Aim 3, the proposed approach is demonstrated as an enabling technology. In this work, calcium phosphate was tested and verified the modularity of the chimeric antimicrobial self-assembling peptide approach. Other chimeric peptides were designed for common biomaterials zirconia and urethane polymer. Finally, an antimicrobial peptide was engineered for a dental adhesive system toward applying spacer design concepts to optimize the antimicrobial activity

    A Comparison of Four Approaches to Discretization Based on Entropy †

    Get PDF
    We compare four discretization methods, all based on entropy: the original C4.5 approach to discretization, two globalized methods, known as equal interval width and equal frequency per interval, and a relatively new method for discretization called multiple scanning using the C4.5 decision tree generation system. The main objective of our research is to compare the quality of these four methods using two criteria: an error rate evaluated by ten-fold cross-validation and the size of the decision tree generated by C4.5. Our results show that multiple scanning is the best discretization method in terms of the error rate and that decision trees generated from datasets discretized by multiple scanning are simpler than decision trees generated directly by C4.5 or generated from datasets discretized by both globalized discretization methods

    Discretisation of conditions in decision rules induced for continuous

    Get PDF
    Typically discretisation procedures are implemented as a part of initial pre-processing of data, before knowledge mining is employed. It means that conclusions and observations are based on reduced data, as usually by discretisation some information is discarded. The paper presents a different approach, with taking advantage of discretisation executed after data mining. In the described study firstly decision rules were induced from real-valued features. Secondly, data sets were discretised. Using categories found for attributes, in the third step conditions included in inferred rules were translated into discrete domain. The properties and performance of rule classifiers were tested in the domain of stylometric analysis of texts, where writing styles were defined through quantitative attributes of continuous nature. The performed experiments show that the proposed processing leads to sets of rules with significantly reduced sizes while maintaining quality of predictions, and allows to test many data discretisation methods at the acceptable computational costs

    Rule Induction on Data Sets with Set-Value Attributes

    Get PDF
    Data sets may have instances where multiple values are possible which are described as set-value attributes. The established LEM2 algorithm does not handle data sets with set-value attributes. To solve this problem, a parallel approach was used during LEM2’s execution to avoid preprocessing data. Changing the creation of characteristic sets and attribute-value blocks to include all values for each case allows LEM2 to induce rules on data sets with set-value attributes. The ability to create a single local covering for set-value data sets increases the variety of data LEM2 can process

    Granules for Association Rules and Decision Support in the getRNIA System

    Get PDF
    This paper proposes granules for association rules in Deterministic Information Systems (DISs) and Non-deterministic Information Systems (NISs). Granules for an association rule are defined for every implication, and give us a new methodology for knowledge discovery and decision support. We see that decision support based on a table under the condition P is to fix the decision Q by using the most proper association rule P〵Rightarrow Q. We recently implemented a system getRNIA powered by granules for association rules. This paper describes how the getRNIA system deals with decision support under uncertainty, and shows some results of the experiment
    corecore