Cybercriminals resort to phishing as a simple and cost-effective medium to
perpetrate cyber-attacks on today's Internet. Recent studies in phishing
detection are increasingly adopting automated feature selection over
traditional manually engineered features. This transition is due to the
inability of existing traditional methods to extrapolate their learning to new
data. To this end, in this paper, we propose WebPhish, a deep learning
technique using automatic feature selection extracted from the raw URL and HTML
of a web page. This approach is the first of its kind, which uses the
concatenation of URL and HTML embedding feature vectors as input into a
Convolutional Neural Network model to detect phishing attacks on web pages.
Extensive experiments on a real-world dataset yielded an accuracy of 98
percent, outperforming other state-of-the-art techniques. Also, WebPhish is a
client-side strategy that is completely language-independent and can conduct
lightweight phishing detection regardless of the web page's textual language

Chen, Yingke

Opara, Chidimma

wei, Bo.

English

arXiv

Phishing websites distribute unsolicited content and are frequently used to commit email and internet fraud. Detecting them before any user information is submitted is critical. Several efforts have been made to detect these phishing websites in recent years. Most existing approaches use hand-crafted lexical and statistical features from a website's textual content to train classification models to detect phishing web pages. However, these phishing detection approaches have limitations, including (1) the tediousness of extracting hand-crafted features, which require specialized domain knowledge to determine which features are useful for a particular platform; and (2) the difficulties encountered by models built on hand-crafted features to capture the semantic patterns in words and characters in URL and HTML content. To address these challenges, this paper proposes WebPhish, an end-to-end deep neural network trained using embedded raw URLs and HTML content to detect website phishing attacks. First, the proposed model automatically employs an embedding technique to extract the corresponding characters into homologous dense vectors. Then, the concatenation layer merges the URL and HTML embedding matrices. Following that, Convolutional layers are used to model its semantic dependencies. Extensive experiments were conducted with real-world phishing data, which yielded an accuracy of 98.1%, showing that WebPhish outperforms baseline detection approaches in identifying phishing pages.</p

Wei, Bo

Teeside University's Research Repository

Look before you leap:Detecting phishing web pages by exploiting raw URL and HTML characteristics.

Phishing websites distribute unsolicited content and are frequently used to
commit email and internet fraud; detecting them before any user information is
submitted is critical. Several efforts have been made to detect these phishing
websites in recent years. Most existing approaches use hand-crafted lexical and
statistical features from a website's textual content to train classification
models to detect phishing web pages. However, these phishing detection
approaches have a few challenges, including 1) the tediousness of extracting
hand-crafted features, which require specialized domain knowledge to determine
which features are useful for a particular platform; and 2) the difficulties
encountered by models built on hand-crafted features to capture the semantic
patterns in words and characters in URL and HTML content. To address these
challenges, this paper proposes WebPhish, an end-to-end deep neural network
trained using embedded raw URLs and HTML content to detect website phishing
attacks. First, the proposed model automatically employs an embedding technique
to extract the corresponding characters into homologous dense vectors. Then,
the concatenation layer merges the URL and HTML embedding matrices. Following
that, Convolutional layers are used to model its semantic dependencies.
Extensive experiments were conducted with real-world phishing data, which
yielded an accuracy of 98.1\%, showing that WebPhish outperforms baseline
detection approaches in identifying phishing pages

Look Before You Leap: Detecting Phishing Web Pages by Exploiting Raw URL And HTML Characteristics

Abstract

Similar works

Full text

Available Versions

Teeside University's Research Repository

arXiv.org e-Print Archive

Teeside University's Research Repository