Search CORE

1,792 research outputs found

k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

Author: Cunningham Padraig
Delany Sarah Jane
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/04/2020
Field of study

Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

arXiv.org e-Print Archive

Arrow@TUDublin

Investigating Text Message Classification Using Case-based Reasoning

Author: Healy Matt
Publication venue: Technological University Dublin
Publication date: 01/01/2007
Field of study

Text classification is the categorization of text into a predefined set of categories. Text classification is becoming increasingly important given the large volume of text stored electronically e.g. email, digital libraries and the World Wide Web (WWW). These documents represent a massive amount of information that can be accessed easily. To gain benefit from using this information requires organisation. One way of organising it automatically is to use text classification. A number of well known machine learning techniques have been used in text classification including Naïve Bayes, Support Vector Machines and Decision Trees, and the less commonly used are k-Nearest Neighbour, Neural Networks and Genetic Algorithms. One aspect of text classification is general message classification, the ability to correctly classify text messages containing text of different lengths. There are many applications that would benefit from this. An example of such applications are, personal emailing filtering, filtering email into different categories of business and personal email and spam email and email routing, e.g. routing email for a helpdesk, so that the email reaches the correct person. This thesis presents an investigation of applying a Case based Reasoning (CBR) approach to general text message classification. Case-based Reasoning was chosen as it was found to perform well for a particular type of message classification, spam filtering. CBR was found to have certain advantages over other machine learning techniques such as Naïve Bayes. It was able to handle the dynamic nature of spam better than other machine learning techniques and offered the ability for the training data to be easily updated continuously and to have new training data immediately available. The objective of this research is to extend previous work conducted on spam filtering to general message classification, which includes classifying short and long text messages into multiple categories. Short text message classification presents a particular challenge as the concept being learnt is weak. We investigated two types of similarity metrics used with CBR, feature based and featureless similarity metrics. We then compared CBR using both feature based and featureless similarity metrics with two well known machine learning techniques. Naïve Bayes (NB) and Support Vector machine (SVM). These two machine learning techniques serve as base line classifiers as they seem to be currently the classifier of choice in the text classification domain. The results of this search show that CBR using a featureless similarity metric achieves better performance than CBR using a feature base similarity metric. The results also show that when using CBR with a feature based similarity metric the classification task required different feature types and different feature representations, depending on the domain. We also investigated whether a case-base editing technique developed for spam case-bases improve the performance over unedited case-bases on different text domains. We found that the case-base editing technique used for spam filtering performs well for email based case-bases but not for other text domains of either short or long text messages

Arrow@TUDublin

On the use of Locality for Improving SVM-Based Spam Filtering

Author: Longe O. B.
Ojo F. O.
Okesola J. O.
Publication venue
Publication date: 01/01/2015
Field of study

Recent growths in the use of email for communication and the corresponding growths in the volume of email received have made automatic processing of emails desirable. In tandem is the prevailing problem of Advance Fee fraud E-mails that pervades inboxes globally. These genres of e-mails solicit for financial transactions and funds transfers from unsuspecting users. Most modern mail-reading software packages provide some forms of programmable automatic filtering, typically in the form of sets of rules that file or otherwise dispose mails based on keywords detected in the headers or message body. Unfortunately programming these filters is an arcane and sometimes inefficient process. An adaptive mail system which can learn its users’ mail sorting preferences would therefore be more desirable. Premised on the work of Blanzieri & Bryl (2007), we proposes a framework dedicated to the phenomenon of locality in email data analysis of advance fee fraud e-mails which engages Support Vector Machines (SVM) classifier for building local decision rules into the classification process of the spam filter design for this genre of e-mails

Covenant University Repository

Spam classification for online discussions

Author: Wu Hao
Publication venue: 'University of Agder'
Publication date: 01/01/2010
Field of study

Traditionally, spam messages filtering systems are built by integrating content-based analysis technologies which are developed from the experiences of dealing with E-mail spam. Recently, the new style of information appears in the Internet, Social Media platform, which also expands the space for Internet abusers. In this thesis, we not only evaluated the traditional content-based approaches to classify spam messages, we also investigated the possibility of integrating context-based technology with con-tent-based approaches to classify spam messages. We built spam classifiers using Novelty de-tection approach combining with Naïve Bayes, k Nearest-Neighbour and Self-organizing map respectively and tested each of them with vast amount of experiment data. And we also took a further step from the previous researches by integrating Self-organizing map with Naive Bayes to carry out the spam classification. The results of this thesis show that combining context-based approaches with content-based spam classifier wisely can actually improve the performance of content-based spam classifier in variant of directions. In addition, the results from Self-organizing map classifier with Naïve Bayes show a promising future for data clustering method using in spam filtering. Thus we believe this thesis presents a new insight in Natural Language Processing and the methods and techniques proposed in this thesis provide researchers in spam filtering field a good tool to analyze context-based spam messages

NORA - Norwegian Open Research Archives

Agder University Research Archive

One-Class Classification: Taxonomy of Study and Review of Techniques

Author: Khan Shehroz S.
Madden Michael G.
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 29/11/2013
Field of study

One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure

arXiv.org e-Print Archive

Crossref

Access to Research at National University of Ireland, Galway

Web Spam DetectionUsing Fuzzy Clustering

Author: J.Shyam Jegadeesh, P.Libin Jacob, J.John Spencer, C.Stanly DevaKumar
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/12/2013
Field of study

Internet is the most widespread medium to express our views and ideas and a lucrative platform for delivering the products. F or this in tention, search engine plays a key role. The information or data about the web pages are stored in an index database of the search engine for use in later queries. Web spam refers to a host of techniques to challenge the ranking algorithms of web search en gines and cause them to rank their web pages higher or for some other beneficial purpose. Usually, the web spam is irritating the web surfers and makes disruption. It ruins the quality of the web search engine. So, in this paper, we presented an efficient clustering method to detect the spam web pages effectively and accurately. Also, we employed various validation measures to validate our research work by using the clustering methods. The comparison s between the obtained charts and the val idation results clearly explain that the research work we presented produces the better result

International Journal on Recent and Innovation Trends in Computing and Communication

A Comparative Study of Classification Techniques for Fraud Detection

Author: Er. Monika, Er. Amarpreet Kaur
Publication venue: Auricle Global Society of Education and Research
Publication date: 31/05/2018
Field of study

There is large volume of data generated each day and the handling such large volume of data is very cumbersome. The generated data is stored in huge databases which can be retrieved as per the user. There are large sized repositories and databases generated in which the data can be stored. However, the retrieval of important data from such large databases is a major concern. There are numerous tools presented which can help in extracting useful information from the databases as per the requirement of users. The mechanism through which the data can be stored and extracted efficiently as per the requirement is known as data mining. This review paper studied about the classification techniques on the basis of different types of algorithms like Decision tree, Na�ve bayes, Rule based, K-NN(K Nearest Neighbour), Artificial Neural Network. It describe the uses of various classification algorithm for develop a predictive model which is useful in different fields like Software fault prediction , credit card fraud analytics, and intrusion detection, medical and so on with respect to accuracy during the past few years

International Journal on Future Revolution in Computer Science & Communication Engineering

Index ordering by query-independent measures

Author: Alan F. Smeaton
Amento
Anh
Anh
Anh
Baeza-Yates
Broder
Büttcher
Chakrabarti
Fagni
Ferguson
Garcia
Joachims
Joachims
Kleinberg
Moffat
Ntoulas
Park
Paul Ferguson
Persin
Plachouras
Robertson
Vapnik
Wang
Witten
Xue
Zhai
Zhang
Zipf
Publication venue: 'Elsevier BV'
Publication date: 01/05/2012
Field of study

Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming. A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced

Crossref

Irish Universities

DCU Online Research Access Service

Active Learning for Text Classification

Author: Hu Rong
Publication venue: Dublin Institute of Technology
Publication date: 01/10/2011
Field of study

Text classification approaches are used extensively to solve real-world challenges. The success or failure of text classification systems hangs on the datasets used to train them, without a good dataset it is impossible to build a quality system. This thesis examines the applicability of active learning in text classification for the rapid and economical creation of labelled training data. Four main contributions are made in this thesis. First, we present two novel selection strategies to choose the most informative examples for manually labelling. One is an approach using an advanced aggregated confidence measurement instead of the direct output of classifiers to measure the confidence of the prediction and choose the examples with least confidence for querying. The other is a simple but effective exploration guided active learning selection strategy which uses only the notions of density and diversity, based on similarity, in its selection strategy. Second, we propose new methods of using deterministic clustering algorithms to help bootstrap the active learning process. We first illustrate the problems of using non-deterministic clustering for selecting initial training sets, showing how non-deterministic clustering methods can result in inconsistent behaviour in the active learning process. We then compare various deterministic clustering techniques and commonly used non-deterministic ones, and show that deterministic clustering algorithms are as good as non-deterministic clustering algorithms at selecting initial training examples for the active learning process. More importantly, we show that the use of deterministic approaches stabilises the active learning process. Our third direction is in the area of visualising the active learning process. We demonstrate the use of an existing visualisation technique in understanding active learning selection strategies to show that a better understanding of selection strategies can be achieved with the help of visualisation techniques. Finally, to evaluate the practicality and usefulness of active learning as a general dataset labelling methodology, it is desirable that actively labelled dataset can be reused more widely instead of being only limited to some particular classifier. We compare the reusability of popular active learning methods for text classification and identify the best classifiers to use in active learning for text classification. This thesis is concerned using active learning methods to label large unlabelled textual datasets. Our domain of interest is text classification, but most of the methods proposed are quite general and so are applicable to other domains having large collections of data with high dimensionality

Arrow@TUDublin