Search CORE

2 research outputs found

Perceptual Hashing applied to Tor domains recognition

Author: Biswas Rubel
Fernandez Eduardo Fidalgo
Martino Francisco Jáñez
Medina Pablo Blanco
Vasco-Carofilis Roberto A.
Publication venue
Publication date: 21/05/2020
Field of study

The Tor darknet hosts different types of illegal content, which are monitored by cybersecurity agencies. However, manually classifying Tor content can be slow and error-prone. To support this task, we introduce Frequency-Dominant Neighborhood Structure (F-DNS), a new perceptual hashing method for automatically classifying domains by their screenshots. First, we evaluated F-DNS using images subject to various content preserving operations. We compared them with their original images, achieving better correlation coefficients than other state-of-the-art methods, especially in the case of rotation. Then, we applied F-DNS to categorize Tor domains using the Darknet Usage Service Images-2K (DUSI-2K), a dataset with screenshots of active Tor service domains. Finally, we measured the performance of F-DNS against an image classification approach and a state-of-the-art hashing method. Our proposal obtained 98.75% accuracy in Tor images, surpassing all other methods compared.Comment: To be published on the JNIC 2020 Conference. Already published research summar

arXiv.org e-Print Archive

Classification of Spam Emails through Hierarchical Clustering and Supervised Learning

Author: Fidalgo Eduardo
González-Martínez Santiago
Jáñez-Martino Francisco
Velasco-Mata Javier
Publication venue
Publication date: 28/05/2020
Field of study

Spammers take advantage of email popularity to send indiscriminately unsolicited emails. Although researchers and organizations continuously develop anti-spam filters based on binary classification, spammers bypass them through new strategies, like word obfuscation or image-based spam. For the first time in literature, we propose to classify spam email in categories to improve the handle of already detected spam emails, instead of just using a binary model. First, we applied a hierarchical clustering algorithm to create SPEMC-

11

K (SPam EMail Classification), the first multi-class dataset, which contains three types of spam emails: Health and Technology, Personal Scams, and Sexual Content. Then, we used SPEMC-

11

K to evaluate the combination of TF-IDF and BOW encodings with Na\"ive Bayes, Decision Trees and SVM classifiers. Finally, we recommend for the task of multi-class spam classification the use of (i) TF-IDF combined with SVM for the best micro F1 score performance,

95.39\%

, and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an email in

2.13

ms.Comment: 4 pages, 2 figures, to be published in conference JNIC 202

arXiv.org e-Print Archive