128 research outputs found

    New Archive-Based Ant Colony Optimization Algorithms for Learning Predictive Rules from Data

    Get PDF
    Data mining is the process of extracting knowledge and patterns from data. Classification and Regression are among the major data mining tasks, where the goal is to predict a value of an attribute of interest for each data instance, given the values of a set of predictive attributes. Most classification and regression problems involve continuous, ordinal and categorical attributes. Currently Ant Colony Optimization (ACO) algorithms have focused on directly handling categorical attributes only; continuous attributes are transformed using a discretisation procedure in either a preprocessing stage or dynamically during the rule creation. The use of a discretisation procedure has several limitations: (i) it increases the computational runtime, since several candidates values need to evaluated; (ii) requires access to the entire attribute domain, which in some applications all data is not available; (iii) the values used to create discrete intervals are not optimised in combination with the values of other attributes. This thesis investigates the use of solution archive pheromone model, based on Ant Colony Optimization for mixed-variable (ACOMV) algorithm, to directly cope with all attribute types. Firstly, an archive-based ACO classification algorithm is presented, followed by an automatic design framework to generate new configuration of ACO algorithms. Then, we addressed the challenging problem of mining data streams, presenting a new ACO algorithm in combination with a hybrid pheromone model. Finally, the archive-based approach is extended to cope with regression problems. All algorithms presented are compared against well-known algorithms from the literature using publicly available data sets. Our results have been shown to improve the computational time while maintaining a competitive predictive performance

    Improving the hierarchical classification of protein functions With swarm intelligence

    Get PDF
    This thesis investigates methods to improve the performance of hierarchical classification. In terms of this thesis hierarchical classification is a form of supervised learning, where the classes in a data set are arranged in a tree structure. As a base for our new methods we use the TDDC (top-down divide-and-conquer) approach for hierarchical classification, where each classifier is built only to discriminate between sibling classes. Firstly, we propose a swarm intelligence technique which varies the types of classifiers used at each divide within the TDDC tree. Our technique, PSO/ACO-CS (Particle Swarm Optimisation/Ant Colony Optimisation Classifier Selection), finds combinations of classifiers to be used in the TDDC tree using the global search ability of PSO/ACO. Secondly, we propose a technique that attempts to mitigate a major drawback of the TDDC approach. The drawback is that if at any point in the TDDC tree an example is misclassified it can never be correctly classified further down the TDDC tree. Our approach, PSO/ACO-RO (PSO/ACO-Recovery Optimisation) decides whether to redirect examples at a given classifier node using, again, the global search ability of PSO/ACO. Thirdly, we propose an ensemble based technique, HEHRS (Hierarchical Ensembles of Hierarchical Rule Sets), which attempts to boost the accuracy at each classifier node in the TDDC tree by using information from classifiers (rule sets) in the rest of that tree. We use Particle Swarm Optimisation to weight the individual rules within each ensemble. We evaluate these three new methods in hierarchical bioinformatics datasets that we have created for this research. These data sets represent the real world problem of protein function prediction. We find through extensive experimentation that the three proposed methods improve upon the baseline TDDC method to varying degrees. Overall the HEHRS and PSO/ACO- CS-RO approaches are most effective, although they are associated with a higher computational cost

    Evaluation of Existing Methods for High-Order Epistasis Detection

    Get PDF
    [Abstract] Finding epistatic interactions among loci when expressing a phenotype is a widely employed strategy to understand the genetic architecture of complex traits in GWAS. The abundance of methods dedicated to the same purpose, however, makes it increasingly difficult for scientists to decide which method is more suitable for their studies. This work compares the different epistasis detection methods published during the last decade in terms of runtime, detection power and type I error rate, with a special emphasis on high-order interactions. Results show that in terms of detection power, the only methods that perform well across all experiments are the exhaustive methods, although their computational cost may be prohibitive in large-scale studies. Regarding non-exhaustive methods, not one could consistently find epistasis interactions when marginal effects are absent. If marginal effects are present, there are methods that perform well for high-order interactions, such as BADTrees, FDHE-IW, SingleMI or SNPHarvester. As for false-positive control, only SNPHarvester, FDHE-IW and DCHE show good results. The study concludes that there is no single epistasis detection method to recommend in all scenarios. Authors should prioritize exhaustive methods when sufficient computational resources are available considering the data set size, and resort to non-exhaustive methods when the analysis time is prohibitive.10.13039/501100010801-Xunta de Galicia (Grant Number: ED431C2016-037, ED431C2017/04 and ED431G2019/01) 10.13039/501100003176-Ministerio de Educacion Cultura y Deporte (Grant Number: FPU16/01333) 10.13039/501100003329-Ministerio de Economia y Competitividad (Grant Number: CGL2016-75482-P, PID2019-104184RB-I00, AEI/FEDER/EU, 10.13039/50110 and TIN2016-75845-P)Xunta de Galicia; ED431C2016-037Xunta de Galicia; ED431G2019/01Xunta de Galicia; ED431C 2017/0

    Data-Mining Techniques for Call-Graph-Based Software-Defect Localisation

    Get PDF
    Defect localisation is an important problem in software engineering. This dissertation investigates call-graph-mining-based software defect localisation, which supports software developers by providing hints where defects might be located. It extends the state-of-the-art by proposing new graph representations and mining techniques for weighted graphs. This leads to a broader range of detectable defects, to an increased localisation precision and to enhanced scalability

    Learning lost temporal fuzzy association rules

    Get PDF
    Fuzzy association rule mining discovers patterns in transactions, such as shopping baskets in a supermarket, or Web page accesses by a visitor to a Web site. Temporal patterns can be present in fuzzy association rules because the underlying process generating the data can be dynamic. However, existing solutions may not discover all interesting patterns because of a previously unrecognised problem that is revealed in this thesis. The contextual meaning of fuzzy association rules changes because of the dynamic feature of data. The static fuzzy representation and traditional search method are inadequate. The Genetic Iterative Temporal Fuzzy Association Rule Mining (GITFARM) framework solves the problem by utilising flexible fuzzy representations from a fuzzy rule-based system (FRBS). The combination of temporal, fuzzy and itemset space was simultaneously searched with a genetic algorithm (GA) to overcome the problem. The framework transforms the dataset to a graph for efficiently searching the dataset. A choice of model in fuzzy representation provides a trade-off in usage between an approximate and descriptive model. A method for verifying the solution to the hypothesised problem was presented. The proposed GA-based solution was compared with a traditional approach that uses an exhaustive search method. It was shown how the GA-based solution discovered rules that the traditional approach did not. This shows that simultaneously searching for rules and membership functions with a GA is a suitable solution for mining temporal fuzzy association rules. So, in practice, more knowledge can be discovered for making well-informed decisions that would otherwise be lost with a traditional approach.EPSRC DT

    Knowledge discovery for moderating collaborative projects

    Get PDF
    In today's global market environment, enterprises are increasingly turning towards collaboration in projects to leverage their resources, skills and expertise, and simultaneously address the challenges posed in diverse and competitive markets. Moderators, which are knowledge based systems have successfully been used to support collaborative teams by raising awareness of problems or conflicts. However, the functioning of a moderator is limited to the knowledge it has about the team members. Knowledge acquisition, learning and updating of knowledge are the major challenges for a Moderator's implementation. To address these challenges a Knowledge discOvery And daTa minINg inteGrated (KOATING) framework is presented for Moderators to enable them to continuously learn from the operational databases of the company and semi-automatically update the corresponding expert module. The architecture for the Universal Knowledge Moderator (UKM) shows how the existing moderators can be extended to support global manufacturing. A method for designing and developing the knowledge acquisition module of the Moderator for manual and semi-automatic update of knowledge is documented using the Unified Modelling Language (UML). UML has been used to explore the static structure and dynamic behaviour, and describe the system analysis, system design and system development aspects of the proposed KOATING framework. The proof of design has been presented using a case study for a collaborative project in the form of construction project supply chain. It has been shown that Moderators can "learn" by extracting various kinds of knowledge from Post Project Reports (PPRs) using different types of text mining techniques. Furthermore, it also proposed that the knowledge discovery integrated moderators can be used to support and enhance collaboration by identifying appropriate business opportunities and identifying corresponding partners for creation of a virtual organization. A case study is presented in the context of a UK based SME. Finally, this thesis concludes by summarizing the thesis, outlining its novelties and contributions, and recommending future research

    Comparative process mining:analyzing variability in process data

    Get PDF

    Comparative process mining:analyzing variability in process data

    Get PDF

    Hierarchically organised genetic algorithm for fuzzy network synthesis

    Get PDF

    Learning Bayesian network equivalence classes using ant colony optimisation

    Get PDF
    Bayesian networks have become an indispensable tool in the modelling of uncertain knowledge. Conceptually, they consist of two parts: a directed acyclic graph called the structure, and conditional probability distributions attached to each node known as the parameters. As a result of their expressiveness, understandability and rigorous mathematical basis, Bayesian networks have become one of the first methods investigated, when faced with an uncertain problem domain. However, a recurring problem persists in specifying a Bayesian network. Both the structure and parameters can be difficult for experts to conceive, especially if their knowledge is tacit.To counteract these problems, research has been ongoing, on learning both the structure and parameters of Bayesian networks from data. Whilst there are simple methods for learning the parameters, learning the structure has proved harder. Part ofthis stems from the NP-hardness of the problem and the super-exponential space of possible structures. To help solve this task, this thesis seeks to employ a relatively new technique, that has had much success in tackling NP-hard problems. This technique is called ant colony optimisation. Ant colony optimisation is a metaheuristic based on the behaviour of ants acting together in a colony. It uses the stochastic activity of artificial ants to find good solutions to combinatorial optimisation problems. In the current work, this method is applied to the problem of searching through the space of equivalence classes of Bayesian networks, in order to find a good match against a set of data. The system uses operators that evaluate potential modifications to a current state. Each of the modifications is scored and the results used to inform the search. In order to facilitate these steps, other techniques are also devised, to speed up the learning process. The techniques includeThe techniques are tested by sampling data from gold standard networks and learning structures from this sampled data. These structures are analysed using various goodnessof-fit measures to see how well the algorithms perform. The measures include structural similarity metrics and Bayesian scoring metrics. The results are compared in depth against systems that also use ant colony optimisation and other methods, including evolutionary programming and greedy heuristics. Also, comparisons are made to well known state-of-the-art algorithms and a study performed on a real-life data set. The results show favourable performance compared to the other methods and on modelling the real-life data
    corecore