2,536 research outputs found
PhishDef: URL Names Say It All
Phishing is an increasingly sophisticated method to steal personal user
information using sites that pretend to be legitimate. In this paper, we take
the following steps to identify phishing URLs. First, we carefully select
lexical features of the URLs that are resistant to obfuscation techniques used
by attackers. Second, we evaluate the classification accuracy when using only
lexical features, both automatically and hand-selected, vs. when using
additional features. We show that lexical features are sufficient for all
practical purposes. Third, we thoroughly compare several classification
algorithms, and we propose to use an online method (AROW) that is able to
overcome noisy training data. Based on the insights gained from our analysis,
we propose PhishDef, a phishing detection system that uses only URL names and
combines the above three elements. PhishDef is a highly accurate method (when
compared to state-of-the-art approaches over real datasets), lightweight (thus
appropriate for online and client-side deployment), proactive (based on online
classification rather than blacklists), and resilient to training data
inaccuracies (thus enabling the use of large noisy training data).Comment: 9 pages, submitted to IEEE INFOCOM 201
Recommended from our members
REST: A thread embedding approach for identifying and classifying user-specified information in security forums
VoG: Summarizing and Understanding Large Graphs
How can we succinctly describe a million-node graph with a few simple
sentences? How can we measure the "importance" of a set of discovered subgraphs
in a large graph? These are exactly the problems we focus on. Our main ideas
are to construct a "vocabulary" of subgraph-types that often occur in real
graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the
most succinct description of a graph in terms of this vocabulary. We measure
success in a well-founded way by means of the Minimum Description Length (MDL)
principle: a subgraph is included in the summary if it decreases the total
description length of the graph.
  Our contributions are three-fold: (a) formulation: we provide a principled
encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop
\method, an efficient method to minimize the description cost, and (c)
applicability: we report experimental results on multi-million-edge real
graphs, including Flickr and the Notre Dame web graph.Comment: SIAM International Conference on Data Mining (SDM) 201
- …
