3 research outputs found
GlyphNet: Homoglyph domains dataset and detection using attention-based Convolutional Neural Networks
Cyber attacks deceive machines into believing something that does not exist
in the first place. However, there are some to which even humans fall prey. One
such famous attack that attackers have used over the years to exploit the
vulnerability of vision is known to be a Homoglyph attack. It employs a primary
yet effective mechanism to create illegitimate domains that are hard to
differentiate from legit ones. Moreover, as the difference is pretty
indistinguishable for a user to notice, they cannot stop themselves from
clicking on these homoglyph domain names. In many cases, that results in either
information theft or malware attack on their systems. Existing approaches use
simple, string-based comparison techniques applied in primary language-based
tasks. Although they are impactful to some extent, they usually fail because
they are not robust to different types of homoglyphs and are computationally
not feasible because of their time requirement proportional to the string
length. Similarly, neural network-based approaches are employed to determine
real domain strings from fake ones. Nevertheless, the problem with both methods
is that they require paired sequences of real and fake domain strings to work
with, which is often not the case in the real world, as the attacker only sends
the illegitimate or homoglyph domain to the vulnerable user. Therefore,
existing approaches are not suitable for practical scenarios in the real world.
In our work, we created GlyphNet, an image dataset that contains 4M domains,
both real and homoglyphs. Additionally, we introduce a baseline method for a
homoglyph attack detection system using an attention-based convolutional Neural
Network. We show that our model can reach state-of-the-art accuracy in
detecting homoglyph attacks with a 0.93 AUC on our dataset
Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective
Two interlocking research questions of growing interest and importance in
privacy research are Authorship Attribution (AA) and Authorship Obfuscation
(AO). Given an artifact, especially a text t in question, an AA solution aims
to accurately attribute t to its true author out of many candidate authors
while an AO solution aims to modify t to hide its true authorship.
Traditionally, the notion of authorship and its accompanying privacy concern is
only toward human authors. However, in recent years, due to the explosive
advancements in Neural Text Generation (NTG) techniques in NLP, capable of
synthesizing human-quality open-ended texts (so-called "neural texts"), one has
to now consider authorships by humans, machines, or their combination. Due to
the implications and potential threats of neural texts when used maliciously,
it has become critical to understand the limitations of traditional AA/AO
solutions and develop novel AA/AO solutions in dealing with neural texts. In
this survey, therefore, we make a comprehensive review of recent literature on
the attribution and obfuscation of neural text authorship from a Data Mining
perspective, and share our view on their limitations and promising research
directions.Comment: Accepted at ACM SIGKDD Explorations, Vol. 25, June 202