Search CORE

19 research outputs found

HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis

Author: Chen Yingke
Opara Chidimma
Wei Bo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Recently, the development and implementation of phishing attacks require little technical skills and costs. This uprising has led to an ever-growing number of phishing attacks on the World Wide Web. Consequently, proactive techniques to fight phishing attacks have become extremely necessary. In this paper, we propose HTMLPhish, a deep learning based datadriven end-to-end automatic phishing web page classification approach. Specifically, HTMLPhish receives the content of the HTML document of a web page and employs Convolutional Neural Networks (CNNs) to learn the semantic dependencies in the textual contents of the HTML. The CNNs learn appropriate feature representations from the HTML document embeddings without extensive manual feature engineering. Furthermore, our proposed approach of the concatenation of the word and character embeddings allows our model to manage new features and ensure easy extrapolation to test data. We conduct comprehensive experiments on a dataset of more than 50,000 HTML documents that provides a distribution of phishing to benign web pages obtainable in the real-world that yields over 93% Accuracy and True Positive Rate. Also, HTMLPhish is a completely language-independent and client-side strategy which can, therefore, conduct web page phishing detection regardless of the textual language

arXiv.org e-Print Archive

Northumbria University Research Portal

Crossref

Lancaster E-Prints

Categorization of Phishing Detection Features And Using the Feature Vectors to Classify Phishing Websites

Author
Publication venue
Publication date: 01/01/2017
Field of study

abstract: Phishing is a form of online fraud where a spoofed website tries to gain access to user's sensitive information by tricking the user into believing that it is a benign website. There are several solutions to detect phishing attacks such as educating users, using blacklists or extracting phishing characteristics found to exist in phishing attacks. In this thesis, we analyze approaches that extract features from phishing websites and train classification models with extracted feature set to classify phishing websites. We create an exhaustive list of all features used in these approaches and categorize them into 6 broader categories and 33 finer categories. We extract 59 features from the URL, URL redirects, hosting domain (WHOIS and DNS records) and popularity of the website and analyze their robustness in classifying a phishing website. Our emphasis is on determining the predictive performance of robust features. We evaluate the classification accuracy when using the entire feature set and when URL features or site popularity features are excluded from the feature set and show how our approach can be used to effectively predict specific types of phishing attacks such as shortened URLs and randomized URLs. Using both decision table classifiers and neural network classifiers, our results indicate that robust features seem to have enough predictive power to be used in practice.Dissertation/ThesisMasters Thesis Computer Science 201

ASU Digital Repository

An Evasion Attack against ML-based Phishing URL Detectors

Author: Babar M. Ali
Gaire Raj
Sabir Bushra
Publication venue
Publication date: 18/05/2020
Field of study

Background: Over the year, Machine Learning Phishing URL classification (MLPU) systems have gained tremendous popularity to detect phishing URLs proactively. Despite this vogue, the security vulnerabilities of MLPUs remain mostly unknown. Aim: To address this concern, we conduct a study to understand the test time security vulnerabilities of the state-of-the-art MLPU systems, aiming at providing guidelines for the future development of these systems. Method: In this paper, we propose an evasion attack framework against MLPU systems. To achieve this, we first develop an algorithm to generate adversarial phishing URLs. We then reproduce 41 MLPU systems and record their baseline performance. Finally, we simulate an evasion attack to evaluate these MLPU systems against our generated adversarial URLs. Results: In comparison to previous works, our attack is: (i) effective as it evades all the models with an average success rate of 66% and 85% for famous (such as Netflix, Google) and less popular phishing targets (e.g., Wish, JBHIFI, Officeworks) respectively; (ii) realistic as it requires only 23ms to produce a new adversarial URL variant that is available for registration with a median cost of only $11.99/year. We also found that popular online services such as Google SafeBrowsing and VirusTotal are unable to detect these URLs. (iii) We find that Adversarial training (successful defence against evasion attack) does not significantly improve the robustness of these systems as it decreases the success rate of our attack by only 6% on average for all the models. (iv) Further, we identify the security vulnerabilities of the considered MLPU systems. Our findings lead to promising directions for future research. Conclusion: Our study not only illustrate vulnerabilities in MLPU systems but also highlights implications for future study towards assessing and improving these systems.Comment: Draft for ACM TOP

arXiv.org e-Print Archive

Performance Evaluation of Machine Learning Techniques for Identifying Forged and Phony Uniform Resource Locators (URLs)

Author: Ajayi A. A.
Azeez N. A.
Publication venue: 'African Journals Online (AJOL)'
Publication date: 22/11/2019
Field of study

Since the invention of Information and Communication Technology (ICT), there has been a great shift from the erstwhile traditional approach of handling information across the globe to the usage of this innovation. The application of this initiative cut across almost all areas of human endeavours. ICT is widely utilized in education and production sectors as well as in various financial institutions. It is of note that many people are using it genuinely to carry out their day to day activities while others are using it to perform nefarious activities at the detriment of other cyber users. According to several reports which are discussed in the introductory part of this work, millions of people have become victims of fake Uniform Resource Locators (URLs) sent to their mails by spammers. Financial institutions are not left out in the monumental loss recorded through this illicit act over the years. It is worth mentioning that, despite several approaches currently in place, none could confidently be confirmed to provide the best and reliable solution. According to several research findings reported in the literature, researchers have demonstrated how machine learning algorithms could be employed to verify and confirm compromised and fake URLs in the cyberspace. Inconsistencies have however been noticed in the researchers’ findings and also their corresponding results are not dependable based on the values obtained and conclusions drawn from them. Against this backdrop, the authors carried out a comparative analysis of three learning algorithms (Naïve Bayes, Decision Tree and Logistics Regression Model) for verification of compromised, suspicious and fake URLs and determine which is the best of all based on the metrics (F-Measure, Precision and Recall) used for evaluation. Based on the confusion metrics measurement, the result obtained shows that the Decision Tree (ID3) algorithm achieves the highest values for recall, precision and f-measure. It unarguably provides efficient and credible means of maximizing the detection of compromised and malicious URLs. Finally, for future work, authors are of the opinion that two or more supervised learning algorithms can be hybridized to form a single effective and more efficient algorithm for fake URLs verification.Keywords: Learning-algorithms, Forged-URL, Phoney-URL, performance-compariso

AJOL - African Journals Online

Large-Scale Lexical Classification of Phishing Websites

Author: Medzinskii David
Publication venue: Department of Computer Science, University of Bath
Publication date: 01/05/2017
Field of study

The prominence of phishing has risen over the past years, with the number of unique attacks reaching an all time high in 2016. Attacks can be deployed with minimal cost and effort, enabling attackers to launch large volumes of attacks in short spaces of time. The fast-paced nature of phishing makesautomated detection processes critical for the safe-guarding of Internet users.This study investigates the use of machine learning for phishing detection, with features extracted from the URL only. Through experimentation, a set of 87 effective features were identified, including a significant number of novel features not found in existing research. An evaluation of classificationalgorithms identified that a Random Forest model with 150 trees maximized classification performance,obtaining an F1 score of 0.92 and ROC AUC of 0.97 when testing on a noisy data set of URLs obtained from spam email - a major communication channel where phishing attacks are found. A comparison against existing research indicated that the model built in this study outperforms state-of-the-artlexical classifiers, and often outperforms classifiers that use external features too.The obtained results were used to build a large-scale lexical classifier, Poseidon, that is able to acceleratethe classification of phishing sites, reducing the load on a more expensive classification process by99%. It is shown that Poseidon outperforms existing systems of this nature with respect to various evaluation metrics. Testing on a live feed of 2 million unlabelled URLs/day, Poseidon is able to detect 6000 phishing attacks/month, costing $0.01 per true positive when using a mainstream cloud servicesprovider.This study is one of the few to evaluate classification in a real-life scenario, using phishing and benign URLs retrieved from an environment in which a large proportion of phishing attacks operate

OPUS

Protect sensitive sites from phishing attacks using features extractable from inaccessible phishing URLs

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Understanding the human behavioural factors behind online learners’ susceptibility to phishing attacks

Author: Shargawi Ayman
Publication venue: Lancaster University
Publication date: 01/01/2017
Field of study

Phishing is an act of fraudulence to lure victims to respond to an illegitimate request for the sake of a financial or informational gain (Huang, Qian, and Wang, 2012). Phishing can jeopardize the security of online learning (e-Learning) systems. Phishing cannot be prevented by depending on technical controls alone (Proctor, Schultz and Vu, 2009). Effective Information Security Awareness is key to protecting against Phishing (Chen, Shaw and Yang, 2006). However, most information security awareness programs overlook human behavioural factors as a root cause of exploitation in Phishing (Proctor et al 2009, Anttila et al 2007). This research aims to better understand the human behavioural factors behind online learners’ susceptibility to Phishing attacks (Luo et al, 2013). Thus, literature review was conducted to identify and analyse the human behavioural factors exploited in Phishing attacks with relation to the online learners’ awareness needs. A conceptual framework called ‘Security Awareness Model for Phishing’ (SAMFP) has been developed based on the integration of Endsley’s Situation Awareness model (Endsley, 2015), the awareness delivery guidelines by Chen, Shaw and Yang (2006) and Poepjes’ (2012) Information Security Awareness and Capability Model (ISACM). SAMFP aims to improve information security awareness for online learners. Hence, data was gathered from 100 participants, experienced in learning online, who completed 5 activities: a pre-awareness (1st) assessment test, participating in the 1st awareness session and group discussions, an assessment (2nd) test, participating in the 2nd awareness session and group discussions and finally a post-awareness (3rd) assessment test. Data was analysed quantitatively with 18 hypotheses to validate the effectiveness of the SAMFP model. Following a design based research approach, the researcher was heavily engaged in the design, development and testing of the SAMFP model which included development of training materials, tutoring and assessment of learning outcomes against the research questions and objectives

Lancaster E-Prints

Recommended from our members

FeSAD: Ransomware Detection with Machine learning using Adaption to Concept Drift

Author: Fernando D. W. A.
Publication venue
Publication date
Field of study

Ransomware classification is crucial, and the main issue with ransomware is that misclassification can have devastating effects compromising valuable data and causing significant monetary loss to organisations. In addition to damaging businesses and individuals, ransomware is a malware type that is evolving rapidly, with new families and variants constantly appearing; this can lead to misclassifications by detection systems. Modern detection systems have moved away from heuristic detection methods and use more flexible approaches such as machine learning. Machine learning is an effective way to detect malware; therefore works well when detecting ransomware. Concept drift occurs in machine learning systems; this implies that the statistical properties of the target variable have changed either suddenly or over time. Concept drift is a notable weakness of a machine-learning malware detection system. Concept drift suggests the machine learning algorithm’s rules and principles have become outdated; this phenomenon can represent ransomware evolution. The concept drift phenomenon represented by ransomware evolution presents a significant challenge for Machine Learning intrusion detection systems because of the inevitable degradation of classification models. This thesis proposes FeSAD, a ransomware detection framework designed to counteract the concept drift in ransomware; this is achieved by combining statistical properties of ransomware and benign data with feedback from the classifier to make a reliable classification under concept drift. The FeSAD framework has a feature selection algorithm for systems expected to have concept drift. In addition, there are drift detection and adaptation components to deal with concept drift. The feature selection layer is a proactive solution that generates feature sets that will remain robust over time, and the drift layer is a reactive measure that allows a detection system to reliably and accurately classify samples that show concept drift. The FeSAD framework is designed to work with most machine learning algorithms and is tested under various concept drift scenarios with ransomware and benign files from different distributions; each distribution is defined by the year of release. The FeSAD framework was tested with random forests, multi-layer perceptrons and a Bayesian network and achieved strong results by maintaining a detection rate close to 90% in all concept drift scenarios. The FeSAD framework’s strong detection results under concept drift also show that it prolongs the lifespan of a machine learning classifier by maintaining a high detection rate across different ransomware distributions

City Research Online