Search CORE

1,148 research outputs found

Using online linear classifiers to filter spam Emails

Author: Jones Gareth J.F.
Wang Bin
Wenfeng Pan
Publication venue: Springer Verlag
Publication date: 01/11/2007
Field of study

The performance of two online linear classifiers - the Perceptron and Littlestone’s Winnow – is explored for two anti-spam filtering benchmark corpora - PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: Information Gain (IG), Document Frequency (DF) and Odds Ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using Odds Ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering

Irish Universities

DCU Online Research Access Service

Active Multi-Field Learning for Spam Filtering

Author: Liu Wuying
Wang Lin
Xie Nan
Yi Mianzhu
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 11/02/2015
Field of study

Ubiquitous spam messages cause a serious waste of time and resources. This paper addresses the practical spam filtering problem, and proposes a universal approach to fight with various spam messages. The proposed active multi-field learning approach is based on: 1) It is cost-sensitive to obtain a label for a real-world spam filter, which suggests an active learning idea; and 2) Different messages often have a similar multi-field text structure, which suggests a multi-field learning idea. The multi-field learning framework combines multiple results predicted from field classifiers by a novel compound weight, and each field classifier calculates the arithmetical average of multiple conditional probabilities predicted from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and regards the more uncertain message as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance at greatly reduced label requirements both in email spam filtering and short text spam filtering. Our active multi-field learning performance, the standard (1-ROCA) % measurement, even exceeds the full feedback performance of some advanced individual classifying algorithm

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Kemahiran pemikiran komputasional pelajar melalui modul pembelajaran berasaskan teknologi internet pelbagai benda

Author: Krishnan Puganesri
Publication venue
Publication date: 01/08/2021
Field of study

kemahiran pemikiran komputasional pelajar, ke arah lebih kreatif dan kritis melalui penggunaan Modul Pembelajaran Berasaskan Teknologi Internet Pelbagai Benda (MP-IoT) yang telah dibangunkan oleh penyelidik. Pembangunan MP-IoT mengikut Model ADDIE dan melibatkan Teknologi Arduino yang diterapkan dalam 5 aktiviti pembelajaran secara amali. Kajian berbentuk kuantitatif jenis kuasi-eksperimental ini telah dijalankan ke atas 52 orang pelajar Tingkatan 4 dari 2 buah sekolah di daerah Batu Pahat, Johor dan Kuala Kangsar, Perak. Data pula telah dianalisis secara deskriptif dan inferensi. Satu set ujian pencapaian pra dan pasca sebagai instrument telah dibangunkan. Analisis Item Indeks Kesukaran (IK), Indeks Diskriminasi, serta Interprestasi skor bagi nilai Alpha Cronbach telah digunakan bagi memastikan soalan ujian pencapaian sesuai digunakan. Manakala dalam proses pembangunan modul MP-IoT, seramai 6 orang guru dari mata pelajaran Sains Komputer dipilih sebagai pakar untuk mengenal pasti kesesuaian dari segi format, kandungan dan kebolehgunaan modul yang dibangunkan Skala Likert lima mata digunakan dalam kajian ini. Secara keseluruhannya, dapatan kajian menggunakan ujian-T sampel berpasangan, menunjukkan terdapat perbezaan yang signifikan terhadap tahap pencapaian pelajar kumpulan kawalan yang didedahkan dengan kaedah konvensional dengan kumpulan rawatan yang didedahkan dengan modul MPIoT, dengan nilai p-value adalah .000 iaitu kurang dari .05 (p<0.05). Selain itu, tahap kemahiran pemikiran komputasional pelajar juga meningkat setelah didedahkan dengan modul MP-IoT

UTHM Institutional Repository

Efficient and Trustworthy Review/Opinion Spam Detection

Author: Sanketi P. Raut, Prof. Chitra Wasnik
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/04/2017
Field of study

The most common mode for consumers to express their level of satisfaction with their purchases is through online ratings, which we can refer as Online Review System. Network analysis has recently gained a lot of attention because of the arrival and the increasing attractiveness of social sites, such as blogs, social networking applications, micro blogging, or customer review sites. The reviews are used by potential customers to find opinions of existing users before purchasing the products. Online review systems plays an important part in affecting consumers' actions and decision making, and therefore attracting many spammers to insert fake feedback or reviews in order to manipulate review content and ratings. Malicious users misuse the review website and post untrustworthy, low quality, or sometimes fake opinions, which are referred as Spam Reviews. In this study, we aim at providing an efficient method to identify spam reviews and to filter out the spam content with the dataset of gsmarena.com. Experiments on the dataset collected from gsmarena.com show that the proposed system achieves higher accuracy than the standard na?ve bayes

International Journal on Recent and Innovation Trends in Computing and Communication

A Survey of Email Spam Filtering Methods

Author: Sharma Madhvi
Sharma Sumit
Publication venue: Control Theory and Informatics
Publication date: 30/08/2018
Field of study

E-mail is one of the most secure medium for online communication and transferring data or messages through the web. An overgrowing increase in popularity, the number of unsolicited data has also increased rapidly. To filtering data, different approaches exist which automatically detect and remove these untenable messages. There are several numbers of email spam filtering technique such as Knowledge-based technique, Clustering techniques, Learning based technique, Heuristic processes and so on. This paper illustrates a survey of different existing email spam filtering system regarding Machine Learning Technique (MLT) such as Naive Bayes, SVM, K-Nearest Neighbor, Bayes Additive Regression, KNN Tree, and rules. However, here we present the classification, evaluation and comparison of different email spam filtering system Keywords: e-mail spam, spam filtering methods, machine learning technique, classification, SVM, AN

International Institute for Science, Technology and Education (IISTE): E-Journals

Computing with Granular Words

Author: Hou Hailong
Publication venue: ScholarWorks @ Georgia State University
Publication date: 07/05/2011
Field of study

Computational linguistics is a sub-field of artificial intelligence; it is an interdisciplinary field dealing with statistical and/or rule-based modeling of natural language from a computational perspective. Traditionally, fuzzy logic is used to deal with fuzziness among single linguistic terms in documents. However, linguistic terms may be related to other types of uncertainty. For instance, different users search ‘cheap hotel’ in a search engine, they may need distinct pieces of relevant hidden information such as shopping, transportation, weather, etc. Therefore, this research work focuses on studying granular words and developing new algorithms to process them to deal with uncertainty globally. To precisely describe the granular words, a new structure called Granular Information Hyper Tree (GIHT) is constructed. Furthermore, several technologies are developed to cooperate with computing with granular words in spam filtering and query recommendation. Based on simulation results, the GIHT-Bayesian algorithm can get more accurate spam filtering rate than conventional method Naive Bayesian and SVM; computing with granular word also generates better recommendation results based on users’ assessment when applied it to search engine

ScholarWorks @ Georgia State University

A discrete hidden Markov model for SMS spam detection

Author: Chen Xuemin
Xia Tian
Publication venue: Digital Scholarship @ Texas Southern University
Publication date: 01/07/2020
Field of study

Many machine learning methods have been applied for short messaging service (SMS) spam detection, including traditional methods such as naive Bayes (NB), vector space model (VSM), and support vector machine (SVM), and novel methods such as long short-term memory (LSTM) and the convolutional neural network (CNN). These methods are based on the well-known bag of words (BoW) model, which assumes documents are unordered collection of words. This assumption overlooks an important piece of information, i.e., word order. Moreover, the term frequency, which counts the number of occurrences of each word in SMS, is unable to distinguish the importance of words, due to the length limitation of SMS. This paper proposes a new method based on the discrete hidden Markov model (HMM) to use the word order information and to solve the low term frequency issue in SMS spam detection. The popularly adopted SMS spam dataset from the UCI machine learning repository is used for performance analysis of the proposed HMM method. The overall performance is compatible with deep learning by employing CNN and LSTM models. A Chinese SMS spam dataset with 2000 messages is used for further performance evaluation. Experiments show that the proposed HMM method is not language-sensitive and can identify spam with high accuracy on both datasets

Multidisciplinary Digital Publishing Institute

Texas Southern University, School of Public Affairs: Digital Scholarship