5 research outputs found
An Evasion Attack against ML-based Phishing URL Detectors
Background: Over the year, Machine Learning Phishing URL classification
(MLPU) systems have gained tremendous popularity to detect phishing URLs
proactively. Despite this vogue, the security vulnerabilities of MLPUs remain
mostly unknown. Aim: To address this concern, we conduct a study to understand
the test time security vulnerabilities of the state-of-the-art MLPU systems,
aiming at providing guidelines for the future development of these systems.
Method: In this paper, we propose an evasion attack framework against MLPU
systems. To achieve this, we first develop an algorithm to generate adversarial
phishing URLs. We then reproduce 41 MLPU systems and record their baseline
performance. Finally, we simulate an evasion attack to evaluate these MLPU
systems against our generated adversarial URLs. Results: In comparison to
previous works, our attack is: (i) effective as it evades all the models with
an average success rate of 66% and 85% for famous (such as Netflix, Google) and
less popular phishing targets (e.g., Wish, JBHIFI, Officeworks) respectively;
(ii) realistic as it requires only 23ms to produce a new adversarial URL
variant that is available for registration with a median cost of only
$11.99/year. We also found that popular online services such as Google
SafeBrowsing and VirusTotal are unable to detect these URLs. (iii) We find that
Adversarial training (successful defence against evasion attack) does not
significantly improve the robustness of these systems as it decreases the
success rate of our attack by only 6% on average for all the models. (iv)
Further, we identify the security vulnerabilities of the considered MLPU
systems. Our findings lead to promising directions for future research.
Conclusion: Our study not only illustrate vulnerabilities in MLPU systems but
also highlights implications for future study towards assessing and improving
these systems.Comment: Draft for ACM TOP
A pipeline and comparative study of 12 machine learning models for text classification
Text-based communication is highly favoured as a communication method,
especially in business environments. As a result, it is often abused by sending
malicious messages, e.g., spam emails, to deceive users into relaying personal
information, including online accounts credentials or banking details. For this
reason, many machine learning methods for text classification have been
proposed and incorporated into the services of most email providers. However,
optimising text classification algorithms and finding the right tradeoff on
their aggressiveness is still a major research problem.
We present an updated survey of 12 machine learning text classifiers applied
to a public spam corpus. A new pipeline is proposed to optimise hyperparameter
selection and improve the models' performance by applying specific methods
(based on natural language processing) in the preprocessing stage.
Our study aims to provide a new methodology to investigate and optimise the
effect of different feature sizes and hyperparameters in machine learning
classifiers that are widely used in text classification problems. The
classifiers are tested and evaluated on different metrics including F-score
(accuracy), precision, recall, and run time. By analysing all these aspects, we
show how the proposed pipeline can be used to achieve a good accuracy towards
spam filtering on the Enron dataset, a widely used public email corpus.
Statistical tests and explainability techniques are applied to provide a robust
analysis of the proposed pipeline and interpret the classification outcomes of
the 12 machine learning models, also identifying words that drive the
classification results. Our analysis shows that it is possible to identify an
effective machine learning model to classify the Enron dataset with an F-score
of 94%.Comment: This article has been accepted for publication in Expert Systems with
Applications, April 2022. Published by Elsevier. All data, models, and code
used in this work are available on GitHub at
https://github.com/Angione-Lab/12-machine-learning-models-for-text-classificatio
Commencement August 9, 2014.
The PDF for the August 9, 2014, Texas Tech University commencement exercises is 36 pages long