8 research outputs found

    k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

    Get PDF
    Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

    Towards a corpus for credibility assessment in software practitioner blog articles

    Get PDF
    Blogs are a source of grey literature which are widely adopted by software practitioners for disseminating opinion and experience. Analysing such articles can provide useful insights into the state-of-practice for software engineering research. However, there are challenges in identifying higher quality content from the large quantity of articles available. Credibility assessment can help in identifying quality content, though there is a lack of existing corpora. Credibility is typically measured through a series of conceptual criteria, with 'argumentation' and 'evidence' being two important criteria. We create a corpus labelled for argumentation and evidence that can aid the credibility community. The corpus consists of articles from the blog of a single software practitioner and is publicly available. Three annotators label the corpus with a series of conceptual credibility criteria, reaching an agreement of 0.82 (Fleiss' Kappa). We present preliminary analysis of the corpus by using it to investigate the identification of claim sentences (one of our ten labels). We train four systems (Bert, KNN, Decision Tree and SVM) using three feature sets (Bag of Words, Topic Modelling and InferSent), achieving an F1 score of 0.64 using InferSent and a Linear SVM. Our preliminary results are promising, indicating that the corpus can help future studies in detecting the credibility of grey literature. Future research will investigate the degree to which the sentence level annotations can infer the credibility of the overall document

    Text Categorization and Machine Learning Methods: Current State Of The Art

    Get PDF
    In this informative age, we find many documents are available in digital forms which need classification of the text. For solving this major problem present researchers focused on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of pre classified documents, the characteristics of the categories. The main benefit of the present approach is consisting in the manual definition of a classifier by domain experts where effectiveness, less use of expert work and straightforward portability to different domains are possible. The paper examines the main approaches to text categorization comparing the machine learning paradigm and present state of the art. Various issues pertaining to three different text similarity problems, namely, semantic, conceptual and contextual are also discussed

    k-Nearest Neighbour Classifiers - A Tutorial

    Get PDF
    Perhaps the most straightforward classifier in the arsenal or Machine Learning techniques is the Nearest Neighbour Classifier – classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data.This paper is the second edition of a paper previously published as a technical report . Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods

    Combining farmers' decision rules and landscape stochastic regularities for landscape modelling

    Get PDF
    International audienceLandscape spatial organization (LSO) strongly impacts many environmental issues. Modelling agricultural landscapes and describing meaningful landscape patterns are thus regarded as key-issues for designing sustainable landscapes. Agricultural landscapes are mostly designed by farmers. Their decisions dealing with crop choices and crop allocation to land can be generic and result in landscape regularities, which determine LSO. This paper comes within the emerging discipline called "landscape agronomy", aiming at studying the organization of farming practices at the landscape scale. We here aim at articulating the farm and the landscape scales for landscape modelling. To do so, we develop an original approach consisting in the combination of two methods used separately so far: the identification of explicit farmer decision rules through on-farm surveys methods and the identification of landscape stochastic regularities through data-mining. We applied this approach to the Niort plain landscape in France. Results show that generic farmer decision rules dealing with sunflower or maize area and location within landscapes are consistent with spatiotemporal regularities identified at the landscape scale. It results in a segmentation of the landscape, based on both its spatial and temporal organization and partly explained by generic farmer decision rules. This consistency between results points out that the two modelling methods aid one another for land-use modelling at landscape scale and for understanding the driving forces of its spatial organization. Despite some remaining challenges, our study in landscape agronomy accounts for both spatial and temporal dimensions of crop allocation: it allows the drawing of new spatial patterns coherent with land-use dynamics at the landscape scale, which improves the links to the scale of ecological processes and therefore contributes to landscape ecology.L'organisation du paysage influe sur les problèmes environnementaux. Modéliser les paysages pour les décrire à l'aide de formes significatives est une étage clé. Les paysages agricoles sont principalement construits par les agriculteurs dont les décision d'assolement peuvent être génériques et déterminer des régularités dans l'organisation du paysage. Cet article contribue à l'agronomie des paysage qui est une discipline émergente. Nous cherchons à articuler les échelles du paysage et de l'exploitation agricole en développant deux méthodes : l'une consiste à identifier les décisions des agriculteurs par le bais d'enquêtes, l'autre consiste à retrouver des régularités stochastiques dans le paysage par le bais de fouille de données. Nous avons appliqué cette approche au paysage de la plaine de Niort en France. Les résultats montrent que les décisions des agriculteurs en matière de tournesol et maïs sont génériques et ont des effets sur le paysages que des méthodes de fouille de données révèlent et quantifient

    Feature Selection using Improved Mutual Information for Text Classification

    No full text
    Abstract. A major characteristic of text document classification problem is extremely high dimensionality of text data. In this paper we present two algorithms for feature (word) selection for the purpose of text classification. We used sequential forward selection methods based on improved mutual information introduced by Battiti [1] and Kwak and Choi [6] for non-textual data. These feature evaluation functions take into consideration how features work together. The performance of these evaluation functions compared to the information gain which evaluate features individually is discussed. We present experimental results using naive Bayes classifier based on multinomial model on the Reuters data set. Finally, we analyze the experimental results from various perspectives, including F1-measure, precision and recall. Preliminary experimental results indicate the effectiveness of the proposed feature selection algorithms in a text classification problem.
    corecore