73 research outputs found

    SDRS: a new lossless dimensionality reduction for text corpora

    Get PDF
    In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.info:eu-repo/semantics/acceptedVersio

    A Review on mobile SMS Spam filtering techniques

    Get PDF
    Under short messaging service (SMS) spam is understood the unsolicited or undesired messages received on mobile phones. These SMS spams constitute a veritable nuisance to the mobile subscribers. This marketing practice also worries service providers in view of the fact that it upsets their clients or even causes them lose subscribers. By way of mitigating this practice, researchers have proposed several solutions for the detection and filtering of SMS spams. In this paper, we present a review of the currently available methods, challenges, and future research directions on spam detection techniques, filtering, and mitigation of mobile SMS spams. The existing research literature is critically reviewed and analyzed. The most popular techniques for SMS spam detection, filtering, and mitigation are compared, including the used data sets, their findings, and limitations, and the future research directions are discussed. This review is designed to assist expert researchers to identify open areas that need further improvement

    KLASIFIKASI SMS SPAM MENGGUNAKAN SUPPORT VECTOR MACHINE

    Get PDF
    It is now common for a cellphone to receive spam messages. Great number of received messages making it difficult for human to classify those messages to Spam or no Spam.  One way to overcome this problem is to use Data Mining for automatic classifications. In this paper, we investigate various data mining techniques, named Support Vector Machine, Multinomial Naïve Bayes and Decision Tree for automatic spam detection. Our experimental results show that Support Vector Machine algorithm is the best algorithm over three evaluated algorithms. Support Vector Machine achieves 98.33%, while Multinomial Naïve Bayes achieves 98.13% and Decision Tree is at 97.10 % accuracy

    A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset

    Get PDF
    SMS, one of the most popular and fast-growing GSM value-added services worldwide, has attracted unwanted SMS, also known as SMS spam. The effects of SMS spam are significant as it affects both the users and the service providers, causing a massive gap in trust among both parties. This article presents a deep learning model based on BiLSTM. Further, it compares our results with some of the states of the art machine learning (ML) algorithm on two datasets: our newly collected dataset and the popular UCI SMS dataset. This study aims to evaluate the performance of diverse learning models and compare the result of the new dataset expanded (ExAIS_SMS) using the following metrics the true positive (TP), false positive (FP), F-measure, recall, precision, and overall accuracy. The average accuracy for the BiLSTSM model achieved moderately improved results compared to some of the ML classifiers. The experimental results achieved significant improvement from the ground truth results after effective fine-tuning of some of the parameters. The BiLSTM model using the ExAIS_SMS dataset attained an accuracy of 93.4% and 98.6% for UCI datasets. Further comparison of the two datasets on the state-of-the-art ML classifiers gave an accuracy of Naive Bayes, BayesNet, SOM, decision tree, C4.5, J48 is 89.64%, 91.11%, 88.24%, 75.76%, 80.24%, and 79.2% respectively for ExAIS_SMS datasets. In conclusion, our proposed BiLSTM model showed significant improvement over traditional ML classifiers. To further validate the robustness of our model, we applied the UCI datasets, and our results showed optimal performance while classifying SMS spam messages based on some metrics: accuracy, precision, recall, and F-measure.publishedVersio

    Multi-objective evolutionary optimization for dimensionality reduction of texts represented by synsets

    Get PDF
    Despite new developments in machine learning classification techniques, improving the accuracy of spam filtering is a difficult task due to linguistic phenomena that limit its effectiveness. In particular, we highlight polysemy, synonymy, the usage of hypernyms/hyponyms, and the presence of irrelevant/confusing words. These problems should be solved at the pre-processing stage to avoid using inconsistent information in the building of classification models. Previous studies have suggested that the use of synset-based representation strategies could be successfully used to solve synonymy and polysemy problems. Complementarily, it is possible to take advantage of hyponymy/hypernymy-based to implement dimensionality reduction strategies. These strategies could unify textual terms to model the intentions of the document without losing any information (e.g., bringing together the synsets “viagra”, “ciallis”, “levitra” and other representing similar drugs by using “virility drug” which is a hyponym for all of them). These feature reduction schemes are known as lossless strategies as the information is not removed but only generalised. However, in some types of text classification problems (such as spam filtering) it may not be worthwhile to keep all the information and let dimensionality reduction algorithms discard information that may be irrelevant or confusing. In this work, we are introducing the feature reduction as a multi-objective optimisation problem to be solved using a Multi-Objective Evolutionary Algorithm (MOEA). Our algorithm allows, with minor modifications, to implement lossless (using only semantic-based synset grouping), low-loss (discarding irrelevant information and using semantic-based synset grouping) or lossy (discarding only irrelevant information) strategies. The contribution of this study is two-fold: (i) to introduce different dimensionality reduction methods (lossless, low-loss and lossy) as an optimization problem that can be solved using MOEA and (ii) to provide an experimental comparison of lossless and low-loss schemes for text representation. The results obtained support the usefulness of the low-loss method to improve the efficiency of classifiers.info:eu-repo/semantics/publishedVersio

    Hierarchical categorisation of web tags for Delicious

    Get PDF
    In the scenario of social bookmarking, a user browsing the Web bookmarks web pages and assigns free-text labels (i.e., tags) to them according to their personal preferences. The benefits of social tagging are clear – tags enhance Web content browsing and search. However, since these tags may be publicly available to any Internet user, a privacy attacker may collect this information and extract an accurate snapshot of users’ interests or user profiles, containing sensitive information, such as health-related information, political preferences, salary or religion. In order to hinder attackers in their efforts to profile users, this report focuses on the practical aspects of capturing user interests from their tagging activity. More accurately, we study how to categorise a collection of tags posted by users in one of the most popular bookmarking services, Delicious (http://delicious.com).Preprin

    MDLText aplicado na Filtragem Automática de SPIM e SMS Spam

    Get PDF
    Spam filtering in online instant messages and SMS is a challenging problem nowadays. It is because the messages are often very short and rife with slangs, idioms, symbols, emoticons, and abbreviations which hamper predicting and knowledge discovering. In order to face this problem, we evaluated a simple, fast, scalable, multiclass, and online text classification method based on the minimum description length principle. We conducted experiments using a real and public dataset, which demonstrate that our method is effective on instant messaging and SMS spam filtering in both online and offline learning contexts.A filtragem automática de spam em mensagens instantâneas e SMS é um problema desafiador, pois as mensagens são frequentemente curtas e repletas de ruídos, tais como gírias, expressões idiomáticas, símbolos, emoticons e abreviações, o que dificulta a extração de conhecimento e predição. Para enfrentar esse problema, neste artigo é avaliado um método de classificação de texto baseado no princípio da descrição mais simples, que é eficiente, rápido, escalável, multiclasse e possui aprendizado incremental. Experimentos realizados com uma base de dados real e pública, em cenários de aprendizado online e offline, indicam que o método proposto é promissor para a tarefa de detecção de spam em mensagens instantâneas e SMS

    Hierarchical categorisation of tags for delicious

    Get PDF
    In the scenario of social bookmarking, a user browsing the Web bookmarks web pages and assigns free-text labels (i.e., tags) to them according to their personal preferences. In this technical report, we approach one of the practical aspects when it comes to represent users' interests from their tagging activity, namely the categorization of tags into high-level categories of interest. The reason is that the representation of user profiles on the basis of the myriad of tags available on the Web is certainly unfeasible from various practical perspectives; mainly concerning the unavailability of data to reliably, accurately measure interests across such fine-grained categorisation, and, should the data be available, its overwhelming computational intractability. Motivated by this, our study presents the results of a categorization process whereby a collection of tags posted at Delicious #http://delicious.com# are classified into 200 subcategories of interest.Preprin

    A systematic survey of online data mining technology intended for law enforcement

    Get PDF
    As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies
    corecore