203 research outputs found
Recommended from our members
REST: A thread embedding approach for identifying and classifying user-specified information in security forums
PhishDef: URL Names Say It All
Phishing is an increasingly sophisticated method to steal personal user
information using sites that pretend to be legitimate. In this paper, we take
the following steps to identify phishing URLs. First, we carefully select
lexical features of the URLs that are resistant to obfuscation techniques used
by attackers. Second, we evaluate the classification accuracy when using only
lexical features, both automatically and hand-selected, vs. when using
additional features. We show that lexical features are sufficient for all
practical purposes. Third, we thoroughly compare several classification
algorithms, and we propose to use an online method (AROW) that is able to
overcome noisy training data. Based on the insights gained from our analysis,
we propose PhishDef, a phishing detection system that uses only URL names and
combines the above three elements. PhishDef is a highly accurate method (when
compared to state-of-the-art approaches over real datasets), lightweight (thus
appropriate for online and client-side deployment), proactive (based on online
classification rather than blacklists), and resilient to training data
inaccuracies (thus enabling the use of large noisy training data).Comment: 9 pages, submitted to IEEE INFOCOM 201
REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security Forums
How can we extract useful information from a security forum? We focus on
identifying threads of interest to a security professional: (a) alerts of
worrisome events, such as attacks, (b) offering of malicious services and
products, (c) hacking information to perform malicious acts, and (d) useful
security-related experiences. The analysis of security forums is in its infancy
despite several promising recent works. Novel approaches are needed to address
the challenges in this domain: (a) the difficulty in specifying the "topics" of
interest efficiently, and (b) the unstructured and informal nature of the text.
We propose, REST, a systematic methodology to: (a) identify threads of interest
based on a, possibly incomplete, bag of words, and (b) classify them into one
of the four classes above. The key novelty of the work is a multi-step weighted
embedding approach: we project words, threads and classes in appropriate
embedding spaces and establish relevance and similarity there. We evaluate our
method with real data from three security forums with a total of 164k posts and
21K threads. First, REST robustness to initial keyword selection can extend the
user-provided keyword set and thus, it can recover from missing keywords.
Second, REST categorizes the threads into the classes of interest with superior
accuracy compared to five other methods: REST exhibits an accuracy between
63.3-76.9%. We see our approach as a first step for harnessing the wealth of
information of online forums in a user-friendly way, since the user can loosely
specify her keywords of interest
Unveiling A Hidden Risk: Exposing Educational but Malicious Repositories in GitHub
Are malicious repositories hiding under the educational label in GitHub?
Recent studies have identified collections of GitHub repositories hosting
malware source code with notable collaboration among the developers. Thus,
analyzing GitHub repositories deserves inevitable attention due to its
open-source nature providing easy access to malicious software code and
artifacts. Here we leverage the capabilities of ChatGPT in a qualitative study
to annotate an educational GitHub repository based on maliciousness of its
metadata contents. Our contribution is twofold. First, we demonstrate the
employment of ChatGPT to understand and annotate the content published in
software repositories. Second, we provide evidence of hidden risk in
educational repositories contributing to the opportunities of potential threats
and malicious intents. We carry out a systematic study on a collection of 35.2K
GitHub repositories claimed to be created for educational purposes only. First,
our study finds an increasing trend in the number of such repositories
published every year. Second, 9294 of them are labeled by ChatGPT as malicious,
and further categorization of the malicious ones detects 14 different malware
families including DDoS, keylogger, ransomware and so on. Overall, this
exploratory study flags a wake-up call for the community for better
understanding and analysis of software platforms
- …