Search CORE

2,606 research outputs found

k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

Author: Cunningham Padraig
Delany Sarah Jane
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/04/2020
Field of study

Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

arXiv.org e-Print Archive

Arrow@TUDublin

A concept drift-tolerant case-base editing technique

Author: Lopez De Mantaras R
Lu J
Lu N
Zhang G
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

© 2015 Elsevier B.V. All rights reserved. The evolving nature and accumulating volume of real-world data inevitably give rise to the so-called "concept drift" issue, causing many deployed Case-Based Reasoning (CBR) systems to require additional maintenance procedures. In Case-base Maintenance (CBM), case-base editing strategies to revise the case-base have proven to be effective instance selection approaches for handling concept drift. Motivated by current issues related to CBR techniques in handling concept drift, we present a two-stage case-base editing technique. In Stage 1, we propose a Noise-Enhanced Fast Context Switch (NEFCS) algorithm, which targets the removal of noise in a dynamic environment, and in Stage 2, we develop an innovative Stepwise Redundancy Removal (SRR) algorithm, which reduces the size of the case-base by eliminating redundancies while preserving the case-base coverage. Experimental evaluations on several public real-world datasets show that our case-base editing technique significantly improves accuracy compared to other case-base editing approaches on concept drift tasks, while preserving its effectiveness on static tasks

OPUS - University of Technology Sydney

Digital.CSIC

Textual Case-based Reasoning for Spam Filtering: a Comparison of Feature-based and Feature-free Approaches

Author: Bridge Derek
Delany Sarah Jane
Publication venue: Dublin Institute of Technology
Publication date: 01/10/2006
Field of study

Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining, or even improving, generalisation accuracy. We report empirical results using the Competence-Based Editing (CBE) technique. We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes in generalisation accuracy) than it does when we use the feature-based approach

Arrow@TUDublin

A Comparison of Ensemble and Case-Base Maintenance Techniques for Handling Concept Drift in Spam Filtering

Author: Cunningham Padraig
Delany Sarah Jane
Tsymbal Alexey
Publication venue: Technological University Dublin
Publication date: 01/01/2006
Field of study

The problem of concept drift has recently received con- siderable attention in machine learning research. One important practical problem where concept drift needs to be addressed is spam filtering. The literature on con- cept drift shows that among the most promising ap- proaches are ensembles and a variety of techniques for ensemble construction has been proposed. In this pa- per we compare the ensemble approach to an alternative lazy learning approach to concept drift whereby a sin- gle case-based classifier for spam filtering keeps itself up-to-date through a case-base maintenance protocol. We present an evaluation that shows that the case-base maintenance approach is more effective than a selection of ensemble techniques. The evaluation is complicated by the overriding importance of False Positives (FPs) in spam filtering. The ensemble approaches can have very good performance on FPs because it is possible to bias an ensemble more strongly away from FPs than it is to bias the single classifer. However this comes at consid- erable cost to the overall accurac

Arrow@TUDublin

Technical report and user guide: the 2010 EU kids online survey

Author
Publication venue: London School of Economics and Political Science
Publication date: 01/01/2011
Field of study

This technical report describes the design and implementation of the EU Kids Online survey of 9-16 year old internet using children and their parents in 25 countries European countries

LSE Research Online

The Use of Firewalls in an Academic Environment

Author: Chown Tim
DeRoure Dave
Read Jon
Publication venue: AIC, ECS, USouthampton
Publication date: 01/01/2000
Field of study

CiteSeerX

Southampton (e-Prints Soton)

Managing irrelevant knowledge in CBR models for unsolicited e-mail classification

Author: Corchado Rodríguez Juan Manuel
Díaz Gómez Fernando
Fernández Riverola Florentino
González Peña Daniel
Méndez Jose R.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

The problem of unsolicited e-mail has been increasing during recent years. Fortunately, some advanced technologies have been successfully applied to spam filtering, achieving promising results. Recently, we have introduced SpamHunting, a successful spam filter able to address the concept drift problem by combining a relevant term identification technique with an evolving sliding window strategy. Several successful spam filtering techniques use continuous learning strategies to achieve better adaptation capabilities and address concept drift issues. Nevertheless, due to the presence of concept drift and hidden changes in the environment, the presence of obsolete and irrelevant knowledge becomes a serious drawback. Soon after the launch of the filter, many decisions are made based on irrelevant and/or obsolete knowledge. Therefore, in such a situation, the use of forgetting strategies is as important as the implementation of continuous learning approaches. In this paper we introduce a novel technique designed for identifying and removing the obsolete and irrelevant knowledge that has accumulated over to the passage of time. We have carried out several experiments to test for the suitability of our proposal showing the results obtained and its applicability

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Gestion del Repositorio Documental de la Universidad de Salamanca

Network-based detection of malicious activities - a corporate network perspective

Author: Kidmose Egon
Publication venue: Aalborg Universitetsforlag
Publication date: 01/01/2019
Field of study

VBN

Damage Detection and Mitigation in Open Collaboration Applications

Author: West Andrew Granville
Publication venue: ScholarlyCommons
Publication date: 01/01/2013
Field of study

Collaborative functionality is changing the way information is amassed, refined, and disseminated in online environments. A subclass of these systems characterized by open collaboration uniquely allow participants to *modify* content with low barriers-to-entry. A prominent example and our case study, English Wikipedia, exemplifies the vulnerabilities: 7%+ of its edits are blatantly unconstructive. Our measurement studies show this damage manifests in novel socio-technical forms, limiting the effectiveness of computational detection strategies from related domains. In turn this has made much mitigation the responsibility of a poorly organized and ill-routed human workforce. We aim to improve all facets of this incident response workflow. Complementing language based solutions we first develop content agnostic predictors of damage. We implicitly glean reputations for system entities and overcome sparse behavioral histories with a spatial reputation model combining evidence from multiple granularity. We also identify simple yet indicative metadata features that capture participatory dynamics and content maturation. When brought to bear over damage corpora our contributions: (1) advance benchmarks over a broad set of security issues ( vandalism ), (2) perform well in the first anti-spam specific approach, and (3) demonstrate their portability over diverse open collaboration use cases. Probabilities generated by our classifiers can also intelligently route human assets using prioritization schemes optimized for capture rate or impact minimization. Organizational primitives are introduced that improve workforce efficiency. The whole of these strategies are then implemented into a tool ( STiki ) that has been used to revert 350,000+ damaging instances from Wikipedia. These uses are analyzed to learn about human aspects of the edit review process, properties including scalability, motivation, and latency. Finally, we conclude by measuring practical impacts of work, discussing how to better integrate our solutions, and revealing outstanding vulnerabilities that speak to research challenges for open collaboration security

CiteSeerX

ScholarlyCommons@Penn