4 research outputs found

    Deobfuscating Leetspeak With Deep Learning to Improve Spam Filtering

    Get PDF
    The evolution of anti-spam filters has forced spammers to make greater efforts to bypass filters in order to distribute content over networks. The distribution of content encoded in images or the use of Leetspeak are concrete and clear examples of techniques currently used to bypass filters. Despite the importance of dealing with these problems, the number of studies to solve them is quite small, and the reported performance is very limited. This study reviews the work done so far (very rudimentary) for Leetspeak deobfuscation and proposes a new technique based on using neural networks for decoding purposes. In addition, we distribute an image database specifically created for training Leetspeak decoding models. We have also created and made available four different corpora to analyse the performance of Leetspeak decoding schemes. Using these corpora, we have experimentally evaluated our neural network approach for decoding Leetspeak. The results obtained have shown the usefulness of the proposed model for addressing the deobfuscation of Leetspeak character sequences

    Multi-objective evolutionary optimization for dimensionality reduction of texts represented by synsets

    Get PDF
    Despite new developments in machine learning classification techniques, improving the accuracy of spam filtering is a difficult task due to linguistic phenomena that limit its effectiveness. In particular, we highlight polysemy, synonymy, the usage of hypernyms/hyponyms, and the presence of irrelevant/confusing words. These problems should be solved at the pre-processing stage to avoid using inconsistent information in the building of classification models. Previous studies have suggested that the use of synset-based representation strategies could be successfully used to solve synonymy and polysemy problems. Complementarily, it is possible to take advantage of hyponymy/hypernymy-based to implement dimensionality reduction strategies. These strategies could unify textual terms to model the intentions of the document without losing any information ( e.g. , bringing together the synsets “viagra”, “ciallis”, “levitra” and other representing similar drugs by using “virility drug” which is a hyponym for all of them). These feature reduction schemes are known as lossless strategies as the information is not removed but only generalised. However, in some types of text classification problems (such as spam filtering) it may not be worthwhile to keep all the information and let dimensionality reduction algorithms discard information that may be irrelevant or confusing. In this work, we are introducing the feature reduction as a multi-objective optimisation problem to be solved using a Multi-Objective Evolutionary Algorithm (MOEA). Our algorithm allows, with minor modifications, to implement lossless (using only semantic-based synset grouping), low-loss (discarding irrelevant information and using semantic-based synset grouping) or lossy (discarding only irrelevant information) strategies. The contribution of this study is two-fold: (i) to introduce different dimensionality reduction methods (lossless, low-loss and lossy) as an optimization problem that can be solved using MOEA and (ii) to provide an experimental comparison of lossless and low-loss schemes for text representation. The results obtained support the usefulness of the low-loss method to improve the efficiency of classifiers.Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-84658-C2-2-RXunta de Galicia | Ref. ED431C 2022/03-GRCEusko Jaurlaritza | Ref. IT1676-22Fundação para a Ciência e a Tecnologia | Ref. UIDB/04466/2020Fundação para a Ciência e a Tecnologia | Ref. UIDP/04466/202

    40 Reducción de dimensionalidad sin pérdida en representaciones semánticas de texto

    No full text
    El spam supone actualmente más del 50% del tráfico de correo electrónico. Es la vía de entrada para muchos de los ataques de secuestro de información (ransomware) que sufren las empresas. En este trabajo se propone la utilización de información semántica en los filtros antispam, sustituyendo y agrupando palabras como ‘Viagra’, ‘Cialis’ o ‘Tadalafil’ por su hiperónimo ‘anti_impotence_drug’ y utilizando synsets (conjuntos de sinónimos) para su representación. Se ha diseñado y probado un sistema de generalización de conceptos/palabras sin pérdida de información, que combina la información semántica y los algoritmos genéticos multi-objetivo. Los resultados obtenidos demuestran que es posible mejorar la detección de los mensajes legítimos, así como aumentar la velocidad de clasificación
    corecore