880 research outputs found

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

    Lifted graphical models: a survey

    Get PDF
    Lifted graphical models provide a language for expressing dependencies between different types of entities, their attributes, and their diverse relations, as well as techniques for probabilistic reasoning in such multi-relational domains. In this survey, we review a general form for a lifted graphical model, a par-factor graph, and show how a number of existing statistical relational representations map to this formalism. We discuss inference algorithms, including lifted inference algorithms, that efficiently compute the answers to probabilistic queries over such models. We also review work in learning lifted graphical models from data. There is a growing need for statistical relational models (whether they go by that name or another), as we are inundated with data which is a mix of structured and unstructured, with entities and relations extracted in a noisy manner from text, and with the need to reason effectively with this data. We hope that this synthesis of ideas from many different research groups will provide an accessible starting point for new researchers in this expanding field

    A modified multi-class association rule for text mining

    Get PDF
    Classification and association rule mining are significant tasks in data mining. Integrating association rule discovery and classification in data mining brings us an approach known as the associative classification. One common shortcoming of existing Association Classifiers is the huge number of rules produced in order to obtain high classification accuracy. This study proposes s a Modified Multi-class Association Rule Mining (mMCAR) that consists of three procedures; rule discovery, rule pruning and group-based class assignment. The rule discovery and rule pruning procedures are designed to reduce the number of classification rules. On the other hand, the group-based class assignment procedure contributes in improving the classification accuracy. Experiments on the structured and unstructured text datasets obtained from the UCI and Reuters repositories are performed in order to evaluate the proposed Association Classifier. The proposed mMCAR classifier is benchmarked against the traditional classifiers and existing Association Classifiers. Experimental results indicate that the proposed Association Classifier, mMCAR, produced high accuracy with a smaller number of classification rules. For the structured dataset, the mMCAR produces an average of 84.24% accuracy as compared to MCAR that obtains 84.23%. Even though the classification accuracy difference is small, the proposed mMCAR uses only 50 rules for the classification while its benchmark method involves 60 rules. On the other hand, mMCAR is at par with MCAR when unstructured dataset is utilized. Both classifiers produce 89% accuracy but mMCAR uses less number of rules for the classification. This study contributes to the text mining domain as automatic classification of huge and widely distributed textual data could facilitate the text representation and retrieval processes

    Code smells detection and visualization: A systematic literature review

    Full text link
    Context: Code smells (CS) tend to compromise software quality and also demand more effort by developers to maintain and evolve the application throughout its life-cycle. They have long been catalogued with corresponding mitigating solutions called refactoring operations. Objective: This SLR has a twofold goal: the first is to identify the main code smells detection techniques and tools discussed in the literature, and the second is to analyze to which extent visual techniques have been applied to support the former. Method: Over 83 primary studies indexed in major scientific repositories were identified by our search string in this SLR. Then, following existing best practices for secondary studies, we applied inclusion/exclusion criteria to select the most relevant works, extract their features and classify them. Results: We found that the most commonly used approaches to code smells detection are search-based (30.1%), and metric-based (24.1%). Most of the studies (83.1%) use open-source software, with the Java language occupying the first position (77.1%). In terms of code smells, God Class (51.8%), Feature Envy (33.7%), and Long Method (26.5%) are the most covered ones. Machine learning techniques are used in 35% of the studies. Around 80% of the studies only detect code smells, without providing visualization techniques. In visualization-based approaches several methods are used, such as: city metaphors, 3D visualization techniques. Conclusions: We confirm that the detection of CS is a non trivial task, and there is still a lot of work to be done in terms of: reducing the subjectivity associated with the definition and detection of CS; increasing the diversity of detected CS and of supported programming languages; constructing and sharing oracles and datasets to facilitate the replication of CS detection and visualization techniques validation experiments.Comment: submitted to ARC

    Learning programs by learning from failures

    Full text link
    We describe an inductive logic programming (ILP) approach called learning from failures. In this approach, an ILP system (the learner) decomposes the learning problem into three separate stages: generate, test, and constrain. In the generate stage, the learner generates a hypothesis (a logic program) that satisfies a set of hypothesis constraints (constraints on the syntactic form of hypotheses). In the test stage, the learner tests the hypothesis against training examples. A hypothesis fails when it does not entail all the positive examples or entails a negative example. If a hypothesis fails, then, in the constrain stage, the learner learns constraints from the failed hypothesis to prune the hypothesis space, i.e. to constrain subsequent hypothesis generation. For instance, if a hypothesis is too general (entails a negative example), the constraints prune generalisations of the hypothesis. If a hypothesis is too specific (does not entail all the positive examples), the constraints prune specialisations of the hypothesis. This loop repeats until either (i) the learner finds a hypothesis that entails all the positive and none of the negative examples, or (ii) there are no more hypotheses to test. We introduce Popper, an ILP system that implements this approach by combining answer set programming and Prolog. Popper supports infinite problem domains, reasoning about lists and numbers, learning textually minimal programs, and learning recursive programs. Our experimental results on three domains (toy game problems, robot strategies, and list transformations) show that (i) constraints drastically improve learning performance, and (ii) Popper can outperform existing ILP systems, both in terms of predictive accuracies and learning times.Comment: Accepted for the machine learning journa

    Novel Hierarchical Feature Selection Methods for Classification and Their Application to Datasets of Ageing-Related Genes

    Get PDF
    Hierarchical Feature Selection (HFS) is an under-explored subarea of data mining/machine learning. Unlike conventional (flat) feature selection algorithms, HFS algorithms work by exploiting hierarchical (generalisation-specialisation) relationships between features, in order to try to improve the predictive accuracy of classifiers. The basic idea is to remove hierarchical redundancy between features, where the presence of a feature in an instance implies the presence of all ancestors of that feature in that instance. By using an HFS algorithm to select a feature subset where the hierarchical redundancy among features is eliminated or reduced, and then giving only the selected feature subset to a classification algorithm, it is possible to improve the predictive accuracy of classification algorithms. In terms of applications, this thesis focuses on datasets of ageing-related genes. This type of dataset is an interesting type of application for data mining methods due to the technical difficulty and ethical issues associated with doing ageing experiments with humans and the strategic importance of research on the biology of ageing - since age is the greatest risk factor for a number of diseases, but is still a not well understood biological process. This thesis offers contributions mainly to the area of data mining/machine learning, but also to bioinformatics and the biology of ageing, as discussed next. The first and main type of contribution consists of four novel HFS algorithms, namely: select Hierarchical Information Preserving (HIP) features, select Most Relevant (MR) features, the hybrid HIP–MR algorithm, and the Hierarchy-based Redundancy Eliminated Tree Augmented Naive Bayes (HRE–TAN) algorithm. These algorithms perform lazy learning-based feature selection - i.e. they postpone the learning process to the moment when testing instances are observed and select a specific feature subset for each testing instance. HIP, MR and HIP–MR select features in a data pre-processing phase, before running a classification algorithm, and they select features that can be used as input by any lazy classification algorithm. In contrast, HRE–TAN is a feature selection process embedded in the construction of a lazy TAN classifier. The second type of contribution, relevant to the areas of data mining and bioinformatics, consists of two novel algorithms that exploit the pre-defined structure of the Gene Ontology (GO) and the results of a flat or hierarchical feature selection algorithm to create the network topology of a Bayesian Network Augmented Naive Bayes (BAN) classifier. These are called GO–BAN algorithms. The proposed HFS algorithms were in general evaluated in combination with lazy versions of three Bayesian network classifiers, namely Naïve Bayes, TAN and GO–BAN - except that HRE–TAN works only with TAN. The experiments involved comparing the predictive accuracy obtained by these classifiers using the features selected by the proposed HFS algorithms with the predictive accuracy obtained by these classifiers using the features selected by flat feature selection algorithms, as well as the accuracy obtained by the classifiers using all original features (without feature selection) as a baseline. The experiments used a number of ageing-related datasets, where the instances being classified are genes, the predictive features are GO terms describing hierarchical gene functions, and the classes to be predicted indicate whether a gene has a pro-longevity or anti-longevity effect in the lifespan of a model organism (yeast, worm, fly or mouse). In general, with the exception of the hybrid HIP–MR which did not obtain good results, the other three proposed HFS algorithms (HIP, MR, HRE–TAN) improved the predictive performance of the baseline Bayesian network classifiers - i.e. in general the classifiers obtained higher accuracies when using only the features selected by the HFS algorithm than when using all original features. Overall, the most successful of the four HFS algorithms was HIP, which outperformed all other (hierarchical or flat) feature selection algorithms when used in combination with each of the Naive Bayes, TAN and GO–BAN classifiers. The difference of predictive accuracy between HIP and the other feature selection algorithms was almost always statistically significant - except that the difference of accuracy between HIP and MR was not significant with TAN. Comparing different combinations of a HFS algorithm and a Bayesian network classifier, HIP+NB and HIP+GO–BAN were both the best combination, with the same average rank across all datasets. They obtained predictive accuracies statistically significantly higher than the accuracies obtained by all other combinations of HFS algorithm and classifier. The third type of contribution of this thesis is a contribution to the biology of ageing. More precisely, the proposed HIP and MR algorithms were used to produce rankings of GO terms in decreasing order of their usefulness for predicting the pro-longevity or anti-longevity effect of a gene on a model organism; and the top GO terms in these rankings were interpreted with the help of a biologist expert on ageing, leading to potentially relevant patterns about the biology of ageing

    An Integer Programming approach to Bayesian Network Structure Learning

    Get PDF
    We study the problem of learning a Bayesian Network structure from data using an Integer Programming approach. We study the existing approaches, an in particular some recent works that formulate the problem as an Integer Programming model. By discussing some weaknesses of the existing approaches, we propose an alternative solution, based on a statistical sparsification of the search space. Results show how our approach can lead to promising results, especially for large network

    Code smells detection and visualization: A systematic literature review

    Get PDF
    Context: Code smells (CS) tend to compromise software quality and also demand more effort by developers to maintain and evolve the application throughout its life-cycle. They have long been cataloged with corresponding mitigating solutions called refactoring operations. Objective: This SLR has a twofold goal: the first is to identify the main code smells detection techniques and tools discussed in the literature, and the second is to analyze to which extent visual techniques have been applied to support the former. Method: Over 83 primary studies indexed in major scientific repositories were identified by our search string in this SLR. Then, following existing best practices for secondary studies, we applied inclusion/exclusion criteria to select the most relevant works, extract their features and classify them. Results: We found that the most commonly used approaches to code smells detection are search-based (30.1%), and metric-based (24.1%). Most of the studies (83.1%) use open-source software, with the Java language occupying the first position (77.1%). In terms of code smells, God Class (51.8%), Feature Envy (33.7%), and Long Method (26.5%) are the most covered ones. Machine learning techniques are used in 35% of the studies. Around 80% of the studies only detect code smells, without providing visualization techniques. In visualization-based approaches, several methods are used, such as city metaphors, 3D visualization techniques. Conclusions: We confirm that the detection of CS is a non-trivial task, and there is still a lot of work to be done in terms of: reducing the subjectivity associated with the definition and detection of CS; increasing the diversity of detected CS and of supported programming languages; constructing and sharing oracles and datasets to facilitate the replication of CS detection and visualization techniques validation experiments.info:eu-repo/semantics/acceptedVersio

    Mining Fix Patterns for FindBugs Violations

    Get PDF
    In this paper, we first collect and track a large number of fixed and unfixed violations across revisions of software. The empirical analyses reveal that there are discrepancies in the distributions of violations that are detected and those that are fixed, in terms of occurrences, spread and categories, which can provide insights into prioritizing violations. To automatically identify patterns in violations and their fixes, we propose an approach that utilizes convolutional neural networks to learn features and clustering to regroup similar instances. We then evaluate the usefulness of the identified fix patterns by applying them to unfixed violations. The results show that developers will accept and merge a majority (69/116) of fixes generated from the inferred fix patterns. It is also noteworthy that the yielded patterns are applicable to four real bugs in the Defects4J major benchmark for software testing and automated repair.Comment: Accepted for IEEE Transactions on Software Engineerin
    corecore