Search CORE

4,604 research outputs found

On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow

Author: Allamanis M.
Chowdhury S.
Cicchetti A.
Lin Y.
Pedregosa F.
Settles B.
Settles B.
Soliman M.
Ying A.
Publication venue
Publication date: 01/01/2017
Field of study

Abundant data is the key to successful machine learning. However, supervised learning requires annotated data that are often hard to obtain. In a classification task with limited resources, Active Learning (AL) promises to guide annotators to examples that bring the most value for a classifier. AL can be successfully combined with self-training, i.e., extending a training set with the unlabelled examples for which a classifier is the most certain. We report our experiences on using AL in a systematic manner to train an SVM classifier for Stack Overflow posts discussing performance of software components. We show that the training examples deemed as the most valuable to the classifier are also the most difficult for humans to annotate. Despite carefully evolved annotation criteria, we report low inter-rater agreement, but we also propose mitigation strategies. Finally, based on one annotator's work, we show that self-training can improve the classification accuracy. We conclude the paper by discussing implication for future text miners aspiring to use AL and self-training.Comment: Preprint of paper accepted for the Proc. of the 21st International Conference on Evaluation and Assessment in Software Engineering, 201

arXiv.org e-Print Archive

Lund University Publications

Crossref

Swedish Institute of Computer Science Publications Database

Identifying Insects with Incomplete DNA Barcode Libraries, African Fruit Flies (Diptera: Tephritidae) as a Test Case

Author: A Johnsen
A Srivathsan
CP Meyer
DH Janzen
Dirk Steinke
Floris C. Breman
GS Lim
HA Ross
HS Yoo
IM White
JKJ Van Houdt
KF Armstrong
Kurt Jordaens
M De Meyer
M Kimura
M Virgilio
M Virgilio
M Wiemers
Marc De Meyer
Massimiliano Virgilio
NB Barr
O Folmer
PDN Hebert
PDN Hebert
PDN Hebert
R Meier
R Meier
RD Ward
S Ekesi
S Ratnasingham
T Lefébure
Thierry Backeljau
Tr Ekrem
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

We propose a general working strategy to deal with incomplete reference libraries in the DNA barcoding identification of species. Considering that (1) queries with a large genetic distance with their best DNA barcode match are more likely to be misidentified and (2) imposing a distance threshold profitably reduces identification errors, we modelled relationships between identification performances and distance thresholds in four DNA barcode libraries of Diptera (n = 4270), Lepidoptera (n = 7577), Hymenoptera (n = 2067) and Tephritidae (n = 602 DNA barcodes). In all cases, more restrictive distance thresholds produced a gradual increase in the proportion of true negatives, a gradual decrease of false positives and more abrupt variations in the proportions of true positives and false negatives. More restrictive distance thresholds improved precision, yet negatively affected accuracy due to the higher proportions of queries discarded (viz. having a distance query-best match above the threshold). Using a simple linear regression we calculated an ad hoc distance threshold for the tephritid library producing an estimated relative identification error <0.05. According to the expectations, when we used this threshold for the identification of 188 independently collected tephritids, less than 5% of queries with a distance query-best match below the threshold were misidentified. Ad hoc thresholds can be calculated for each particular reference library of DNA barcodes and should be used as cut-off mark defining whether we can proceed identifying the query with a known estimated error probability (e.g. 5%) or whether we should discard the query and consider alternative/complementary identification methods

Public Library of Science (PLOS)

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

Institutional Repository Universiteit Antwerpen

FigShare

Reasoning & Querying – State of the Art

Author: Bry François
Furche Tim
Weiand Klara
Publication venue
Publication date: 31/08/2008
Field of study

Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF

Open Access LMU

Constrained Query Answering

Author: Bry François
Demolombe R.
Publication venue
Publication date: 01/01/1994
Field of study

Traditional answering methods evaluate queries only against positive and definite knowledge expressed by means of facts and deduction rules. They do not make use of negative, disjunctive or existential information. Negative or indefinite knowledge is however often available in knowledge base systems, either as design requirements, or as observed properties. Such knowledge can serve to rule out unproductive subexpressions during query answering. In this article, we propose an approach for constraining any conventional query answering procedure with general, possibly negative or indefinite formulas, so as to discard impossible cases and to avoid redundant evaluations. This approach does not impose additional conditions on the positive and definite knowledge, nor does it assume any particular semantics for negation. It adopts that of the conventional query answering procedure it constrains. This is achieved by relying on meta-interpretation for specifying the constraining process. The soundness, completeness, and termination of the underlying query answering procedure are not compromised. Constrained query answering can be applied for answering queries more efficiently as well as for generating more informative, intensional answers

Open Access LMU