Search CORE

26,413 research outputs found

Mining data quality rules based on T-dependence

Author: Boeckling Toon
Bronselaer Antoon
De Tré Guy
Publication venue: 'Atlantis Press'
Publication date: 01/01/2019
Field of study

Since their introduction in 1976, edit rules have been a standard tool in statistical analysis. Basically, edit rules are a compact representation of non-permitted combinations of values in a dataset. In this paper, we propose a technique to automatically find edit rules by use of the concept of T-dependence. We first generalize the traditional notion of lift, to that of T-lift, where stochastic independence is generalized to T-dependence. A combination of values is declared as an edit rule under a t-norm T if there is a strong negative correlation under T-dependence. We show several interesting properties of this approach. In particular, we show that under the minimum t-norm, edit rules can be computed efficiently by use of frequent pattern trees. Experimental results show that there is a weak to medium correlation in the rank order of edit rules obtained under T_M and T_P, indicating that the semantics of these kinds of dependencies are different

Crossref

Ghent University Academic Bibliography

Adapting a relation extraction pipeline for the BioCreAtIvE II task

Author: Grover Claire
Haddow Barry
Klein Ewan
Matthews Michael
Nielsen Leif Arda
Tobin Richard
Wang Xinglong
Publication venue
Publication date: 01/01/2007
Field of study

Edinburgh Research Explorer

Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics

Author: Boulesteix Anne-Laure
Janitza Silke
Kruppa Jochen
König Inke R.
Publication venue
Publication date: 25/07/2012
Field of study

The Random Forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research

Crossref

Open Access LMU

Learning Interpretable Rules for Multi-label Classification

Author: A Gabriel
AA Freitas
AJ Knobbe
B Liu
B Minnaert
D Malerba
E Gibaja
E Gibaja
E Loza Mencía
E Montañés
F Charte
F Herrera
F Janssen
F Thabtah
G Bosc
G Tsoumakas
Grigorios Tsoumakas
H Allahyari
J Arunadevi
J Demšar
J Fürnkranz
J Han
J Hipp
J Read
JN Sulzmann
K Dembczyński
K Dembczyński
L Chekina
L Raedt De
LE Sucar
M Atzmüller
M Beckerle
M Friedman
M Zhang
Miltiadis Allamanis
MR Boutell
P Kralj Novak
PJ Hayes
R Senge
RM Cameron-Jones
Shantanu Godbole
W Duivesteijn
W Waegeman
WW Cohen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2018
Field of study

Multi-label classification (MLC) is a supervised learning problem in which, contrary to standard multiclass classification, an instance can be associated with several class labels simultaneously. In this chapter, we advocate a rule-based approach to multi-label classification. Rule learning algorithms are often employed when one is not only interested in accurate predictions, but also requires an interpretable theory that can be understood, analyzed, and qualitatively evaluated by domain experts. Ideally, by revealing patterns and regularities contained in the data, a rule-based theory yields new insights in the application domain. Recently, several authors have started to investigate how rule-based models can be used for modeling multi-label data. Discussing this task in detail, we highlight some of the problems that make rule learning considerably more challenging for MLC than for conventional classification. While mainly focusing on our own previous work, we also provide a short overview of related work in this area.Comment: Preprint version. To appear in: Explainable and Interpretable Models in Computer Vision and Machine Learning. The Springer Series on Challenges in Machine Learning. Springer (2018). See http://www.ke.tu-darmstadt.de/bibtex/publications/show/3077 for further informatio

arXiv.org e-Print Archive

TUbiblio

Crossref

Mining Fix Patterns for FindBugs Violations

Author: Bissyandé Tegawendé F.
Kim Dongsun
Liu Kui
Traon Yves Le
Yoo Shin
Publication venue
Publication date: 01/01/2018
Field of study

In this paper, we first collect and track a large number of fixed and unfixed violations across revisions of software. The empirical analyses reveal that there are discrepancies in the distributions of violations that are detected and those that are fixed, in terms of occurrences, spread and categories, which can provide insights into prioritizing violations. To automatically identify patterns in violations and their fixes, we propose an approach that utilizes convolutional neural networks to learn features and clustering to regroup similar instances. We then evaluate the usefulness of the identified fix patterns by applying them to unfixed violations. The results show that developers will accept and merge a majority (69/116) of fixes generated from the inferred fix patterns. It is also noteworthy that the yielded patterns are applicable to four real bugs in the Defects4J major benchmark for software testing and automated repair.Comment: Accepted for IEEE Transactions on Software Engineerin

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg

FixMiner: Mining Relevant Fix Patterns for Automated Program Repair

Author: Bissyandé Tegawendé F.
Kim Dongsun
Klein Jacques
Koyuncu Anil
Liu Kui
Monperrus Martin
Traon Yves Le
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/09/2019
Field of study

Patching is a common activity in software development. It is generally performed on a source code base to address bugs or add new functionalities. In this context, given the recurrence of bugs across projects, the associated similar patches can be leveraged to extract generic fix actions. While the literature includes various approaches leveraging similarity among patches to guide program repair, these approaches often do not yield fix patterns that are tractable and reusable as actionable input to APR systems. In this paper, we propose a systematic and automated approach to mining relevant and actionable fix patterns based on an iterative clustering strategy applied to atomic changes within patches. The goal of FixMiner is thus to infer separate and reusable fix patterns that can be leveraged in other patch generation systems. Our technique, FixMiner, leverages Rich Edit Script which is a specialized tree structure of the edit scripts that captures the AST-level context of the code changes. FixMiner uses different tree representations of Rich Edit Scripts for each round of clustering to identify similar changes. These are abstract syntax trees, edit actions trees, and code context trees. We have evaluated FixMiner on thousands of software patches collected from open source projects. Preliminary results show that we are able to mine accurate patterns, efficiently exploiting change information in Rich Edit Scripts. We further integrated the mined patterns to an automated program repair prototype, PARFixMiner, with which we are able to correctly fix 26 bugs of the Defects4J benchmark. Beyond this quantitative performance, we show that the mined fix patterns are sufficiently relevant to produce patches with a high probability of correctness: 81% of PARFixMiner's generated plausible patches are correct.Comment: 31 pages, 11 figure

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg