1,901 research outputs found
Recommended from our members
REST: A thread embedding approach for identifying and classifying user-specified information in security forums
Sustainable growth in complex networks
Based on the empirical analysis of the dependency network in 18 Java
projects, we develop a novel model of network growth which considers both: an
attachment mechanism and the addition of new nodes with a heterogeneous
distribution of their initial degree, . Empirically we find that the
cumulative degree distributions of initial degrees and of the final network,
follow power-law behaviors: , and
, respectively. For the total number of links as a
function of the network size, we find empirically ,
where is (at the beginning of the network evolution) between 1.25 and
2, while converging to for large . This indicates a transition from
a growth regime with increasing network density towards a sustainable regime,
which revents a collapse because of ever increasing dependencies. Our
theoretical framework is able to predict relations between the exponents
, , , which also link issues of software engineering and
developer activity. These relations are verified by means of computer
simulations and empirical investigations. They indicate that the growth of real
Open Source Software networks occurs on the edge between two regimes, which are
either dominated by the initial degree distribution of added nodes, or by the
preferential attachment mechanism. Hence, the heterogeneous degree distribution
of newly added nodes, found empirically, is essential to describe the laws of
sustainable growth in networks.Comment: 5 pages, 2 figures, 1 tabl
PhishDef: URL Names Say It All
Phishing is an increasingly sophisticated method to steal personal user
information using sites that pretend to be legitimate. In this paper, we take
the following steps to identify phishing URLs. First, we carefully select
lexical features of the URLs that are resistant to obfuscation techniques used
by attackers. Second, we evaluate the classification accuracy when using only
lexical features, both automatically and hand-selected, vs. when using
additional features. We show that lexical features are sufficient for all
practical purposes. Third, we thoroughly compare several classification
algorithms, and we propose to use an online method (AROW) that is able to
overcome noisy training data. Based on the insights gained from our analysis,
we propose PhishDef, a phishing detection system that uses only URL names and
combines the above three elements. PhishDef is a highly accurate method (when
compared to state-of-the-art approaches over real datasets), lightweight (thus
appropriate for online and client-side deployment), proactive (based on online
classification rather than blacklists), and resilient to training data
inaccuracies (thus enabling the use of large noisy training data).Comment: 9 pages, submitted to IEEE INFOCOM 201
REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security Forums
How can we extract useful information from a security forum? We focus on
identifying threads of interest to a security professional: (a) alerts of
worrisome events, such as attacks, (b) offering of malicious services and
products, (c) hacking information to perform malicious acts, and (d) useful
security-related experiences. The analysis of security forums is in its infancy
despite several promising recent works. Novel approaches are needed to address
the challenges in this domain: (a) the difficulty in specifying the "topics" of
interest efficiently, and (b) the unstructured and informal nature of the text.
We propose, REST, a systematic methodology to: (a) identify threads of interest
based on a, possibly incomplete, bag of words, and (b) classify them into one
of the four classes above. The key novelty of the work is a multi-step weighted
embedding approach: we project words, threads and classes in appropriate
embedding spaces and establish relevance and similarity there. We evaluate our
method with real data from three security forums with a total of 164k posts and
21K threads. First, REST robustness to initial keyword selection can extend the
user-provided keyword set and thus, it can recover from missing keywords.
Second, REST categorizes the threads into the classes of interest with superior
accuracy compared to five other methods: REST exhibits an accuracy between
63.3-76.9%. We see our approach as a first step for harnessing the wealth of
information of online forums in a user-friendly way, since the user can loosely
specify her keywords of interest
- …