11 research outputs found
REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security Forums
How can we extract useful information from a security forum? We focus on
identifying threads of interest to a security professional: (a) alerts of
worrisome events, such as attacks, (b) offering of malicious services and
products, (c) hacking information to perform malicious acts, and (d) useful
security-related experiences. The analysis of security forums is in its infancy
despite several promising recent works. Novel approaches are needed to address
the challenges in this domain: (a) the difficulty in specifying the "topics" of
interest efficiently, and (b) the unstructured and informal nature of the text.
We propose, REST, a systematic methodology to: (a) identify threads of interest
based on a, possibly incomplete, bag of words, and (b) classify them into one
of the four classes above. The key novelty of the work is a multi-step weighted
embedding approach: we project words, threads and classes in appropriate
embedding spaces and establish relevance and similarity there. We evaluate our
method with real data from three security forums with a total of 164k posts and
21K threads. First, REST robustness to initial keyword selection can extend the
user-provided keyword set and thus, it can recover from missing keywords.
Second, REST categorizes the threads into the classes of interest with superior
accuracy compared to five other methods: REST exhibits an accuracy between
63.3-76.9%. We see our approach as a first step for harnessing the wealth of
information of online forums in a user-friendly way, since the user can loosely
specify her keywords of interest
REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security Forums
Recommended from our members
Extracting Actionable Information From Security Forums
The goal of this thesis is to systematically extract information from security forums,whose information would be in general described as unstructured: the text of a postis not necessarily following any writing rules. By contrast, many security initiatives andcommercial entities are harnessing the readily public information, but they seem to focuson structured sources of information. Here, we focus on analyzing text content in securityforums to extract actionable information. Specifically, we search and nd: IP addressesreported in the text, study keyword-based queries, and identify and classify threads thatare of interest to the security analysts.The power of our study lies in the following key novelties. First, we use a matrixdecomposition method to extract latent features of the user behavioral information,which we combine with textual information from related posts. Second, we address thelabeling difficulties by utilizing a cross-forum learning method that helps to transfer knowledgebetween models. Third, we develop a multi-step weighted embedding approach, morespecifically, we project words, threads, and classes in appropriate embedding spaces and establish relevance and similarity there. These novel approaches enable us to extract andrefine information which could not be obtained from security forums if only trivial analyseswere used.We collected a wealth of data from six different security forums. The contributionof our work is threefold: (a) we develop a method to automatically identify malicious IPaddresses observed in the forums; (b) we propose a systematic method to identify andclassify user-specified threads of interest into four different categories, and (c) we presentan iterative approach to expand the initial keywords of interest which are essential feeds insearching and retrieving information.We see our approaches as essential building blocks in developing useful methodsfor harnessing the wealth of information available in online forums
Recommended from our members
Extracting Actionable Information From Security Forums
The goal of this thesis is to systematically extract information from security forums,whose information would be in general described as unstructured: the text of a postis not necessarily following any writing rules. By contrast, many security initiatives andcommercial entities are harnessing the readily public information, but they seem to focuson structured sources of information. Here, we focus on analyzing text content in securityforums to extract actionable information. Specifically, we search and nd: IP addressesreported in the text, study keyword-based queries, and identify and classify threads thatare of interest to the security analysts.The power of our study lies in the following key novelties. First, we use a matrixdecomposition method to extract latent features of the user behavioral information,which we combine with textual information from related posts. Second, we address thelabeling difficulties by utilizing a cross-forum learning method that helps to transfer knowledgebetween models. Third, we develop a multi-step weighted embedding approach, morespecifically, we project words, threads, and classes in appropriate embedding spaces and establish relevance and similarity there. These novel approaches enable us to extract andrefine information which could not be obtained from security forums if only trivial analyseswere used.We collected a wealth of data from six different security forums. The contributionof our work is threefold: (a) we develop a method to automatically identify malicious IPaddresses observed in the forums; (b) we propose a systematic method to identify andclassify user-specified threads of interest into four different categories, and (c) we presentan iterative approach to expand the initial keywords of interest which are essential feeds insearching and retrieving information.We see our approaches as essential building blocks in developing useful methodsfor harnessing the wealth of information available in online forums
Recommended from our members
RIPEx: Extracting Malicious IP Addresses from Security Forums Using Cross-Forum Learning
Is it possible to extract malicious IP addresses reported in security forums
in an automatic way? This is the question at the heart of our work. We focus on
security forums, where security professionals and hackers share knowledge and
information, and often report misbehaving IP addresses. So far, there have only
been a few efforts to extract information from such security forums. We propose
RIPEx, a systematic approach to identify and label IP addresses in security
forums by utilizing a cross-forum learning method. In more detail, the
challenge is twofold: (a) identifying IP addresses from other numerical
entities, such as software version numbers, and (b) classifying the IP address
as benign or malicious. We propose an integrated solution that tackles both
these problems. A novelty of our approach is that it does not require training
data for each new forum. Our approach does knowledge transfer across forums: we
use a classifier from our source forums to identify seed information for
training a classifier on the target forum. We evaluate our method using data
collected from five security forums with a total of 31K users and 542K posts.
First, RIPEx can distinguish IP address from other numeric expressions with 95%
precision and above 93% recall on average. Second, RIPEx identifies malicious
IP addresses with an average precision of 88% and over 78% recall, using our
cross-forum learning. Our work is a first step towards harnessing the wealth of
useful information that can be found in security forums
RAFFMAN: Measuring and Analyzing Sentiment in Online Political Forum Discussions with an Application to the Trump Impeachment
Given an online forum, how can we quantify changes in user
affect towards a person or an idea over time? We argue that
online political forums constitute an untapped opportunity
for understanding sentiment toward aspects under discussion.
However, the analysis of such forums has received little attention
from the research community. In this paper, we develop
RAFFMAN, a systematic approach to quantify the impact of
external events on the affect of forum users towards a concept,
such as a person or an entity. First, we develop an approach
to capture and quantify the observed activity: we identify related
keywords, filter threads, and establish correlations between
events and spikes in the activity. Second, we modify
and evaluate state-of-the-art NLP techniques to achieve high
accuracy (74%) in a three-class sentiment classification problem.
As a case study, we deploy our method to quantify the
effect of President Trump’s impeachment on several concepts
including: President Trump, Speaker Pelosi, and QAnon. Our
data consists of 32M posts from Reddit and 4chan over a span
of 6 months from September 2019 to February 2020. This initial
analysis hints at an increase in political polarization, especially
for people’s affect towards the President. Overall, our
work is a building block towards mining the affect of online
forum user towards a concept, which constitutes a untapped,
massive, and publicly-available source of information