2,404 research outputs found
DeltaPhish: Detecting Phishing Webpages in Compromised Websites
The large-scale deployment of modern phishing attacks relies on the automatic
exploitation of vulnerable websites in the wild, to maximize profit while
hindering attack traceability, detection and blacklisting. To the best of our
knowledge, this is the first work that specifically leverages this adversarial
behavior for detection purposes. We show that phishing webpages can be
accurately detected by highlighting HTML code and visual differences with
respect to other (legitimate) pages hosted within a compromised website. Our
system, named DeltaPhish, can be installed as part of a web application
firewall, to detect the presence of anomalous content on a website after
compromise, and eventually prevent access to it. DeltaPhish is also robust
against adversarial attempts in which the HTML code of the phishing page is
carefully manipulated to evade detection. We empirically evaluate it on more
than 5,500 webpages collected in the wild from compromised websites, showing
that it is capable of detecting more than 99% of phishing webpages, while only
misclassifying less than 1% of legitimate pages. We further show that the
detection rate remains higher than 70% even under very sophisticated attacks
carefully designed to evade our system.Comment: Preprint version of the work accepted at ESORICS 201
HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis
Recently, the development and implementation of phishing attacks require little technical skills and costs. This uprising has led to an ever-growing number of phishing attacks on the World Wide Web. Consequently, proactive techniques to fight phishing attacks have become extremely necessary. In this paper, we propose HTMLPhish, a deep learning based datadriven end-to-end automatic phishing web page classification approach. Specifically, HTMLPhish receives the content of the HTML document of a web page and employs Convolutional Neural Networks (CNNs) to learn the semantic dependencies in the textual contents of the HTML. The CNNs learn appropriate feature representations from the HTML document embeddings without extensive manual feature engineering. Furthermore, our proposed approach of the concatenation of the word and character embeddings allows our model to manage new features and ensure easy extrapolation to test data. We conduct comprehensive experiments on a dataset of more than 50,000 HTML documents that provides a distribution of phishing to benign web pages obtainable in the real-world that yields over 93% Accuracy and True Positive Rate. Also, HTMLPhish is a completely language-independent and client-side strategy which can, therefore, conduct web page phishing detection regardless of the textual language
PhishDef: URL Names Say It All
Phishing is an increasingly sophisticated method to steal personal user
information using sites that pretend to be legitimate. In this paper, we take
the following steps to identify phishing URLs. First, we carefully select
lexical features of the URLs that are resistant to obfuscation techniques used
by attackers. Second, we evaluate the classification accuracy when using only
lexical features, both automatically and hand-selected, vs. when using
additional features. We show that lexical features are sufficient for all
practical purposes. Third, we thoroughly compare several classification
algorithms, and we propose to use an online method (AROW) that is able to
overcome noisy training data. Based on the insights gained from our analysis,
we propose PhishDef, a phishing detection system that uses only URL names and
combines the above three elements. PhishDef is a highly accurate method (when
compared to state-of-the-art approaches over real datasets), lightweight (thus
appropriate for online and client-side deployment), proactive (based on online
classification rather than blacklists), and resilient to training data
inaccuracies (thus enabling the use of large noisy training data).Comment: 9 pages, submitted to IEEE INFOCOM 201
Analyzing Social and Stylometric Features to Identify Spear phishing Emails
Spear phishing is a complex targeted attack in which, an attacker harvests
information about the victim prior to the attack. This information is then used
to create sophisticated, genuine-looking attack vectors, drawing the victim to
compromise confidential information. What makes spear phishing different, and
more powerful than normal phishing, is this contextual information about the
victim. Online social media services can be one such source for gathering vital
information about an individual. In this paper, we characterize and examine a
true positive dataset of spear phishing, spam, and normal phishing emails from
Symantec's enterprise email scanning service. We then present a model to detect
spear phishing emails sent to employees of 14 international organizations, by
using social features extracted from LinkedIn. Our dataset consists of 4,742
targeted attack emails sent to 2,434 victims, and 9,353 non targeted attack
emails sent to 5,912 non victims; and publicly available information from their
LinkedIn profiles. We applied various machine learning algorithms to this
labeled data, and achieved an overall maximum accuracy of 97.76% in identifying
spear phishing emails. We used a combination of social features from LinkedIn
profiles, and stylometric features extracted from email subjects, bodies, and
attachments. However, we achieved a slightly better accuracy of 98.28% without
the social features. Our analysis revealed that social features extracted from
LinkedIn do not help in identifying spear phishing emails. To the best of our
knowledge, this is one of the first attempts to make use of a combination of
stylometric features extracted from emails, and social features extracted from
an online social network to detect targeted spear phishing emails.Comment: Detection of spear phishing using social media feature
- …