13,095 research outputs found
CEAI: CCM based Email Authorship Identification Model
In this paper we present a model for email authorship identification (EAI) by
employing a Cluster-based Classification (CCM) technique. Traditionally,
stylometric features have been successfully employed in various authorship
analysis tasks; we extend the traditional feature-set to include some more
interesting and effective features for email authorship identification (e.g.
the last punctuation mark used in an email, the tendency of an author to use
capitalization at the start of an email, or the punctuation after a greeting or
farewell). We also included Info Gain feature selection based content features.
It is observed that the use of such features in the authorship identification
process has a positive impact on the accuracy of the authorship identification
task. We performed experiments to justify our arguments and compared the
results with other base line models. Experimental results reveal that the
proposed CCM-based email authorship identification model, along with the
proposed feature set, outperforms the state-of-the-art support vector machine
(SVM)-based models, as well as the models proposed by Iqbal et al. [1, 2]. The
proposed model attains an accuracy rate of 94% for 10 authors, 89% for 25
authors, and 81% for 50 authors, respectively on Enron dataset, while 89.5%
accuracy has been achieved on authors' constructed real email dataset. The
results on Enron dataset have been achieved on quite a large number of authors
as compared to the models proposed by Iqbal et al. [1, 2]
DeepAPT: Nation-State APT Attribution Using End-to-End Deep Neural Networks
In recent years numerous advanced malware, aka advanced persistent threats
(APT) are allegedly developed by nation-states. The task of attributing an APT
to a specific nation-state is extremely challenging for several reasons. Each
nation-state has usually more than a single cyber unit that develops such
advanced malware, rendering traditional authorship attribution algorithms
useless. Furthermore, those APTs use state-of-the-art evasion techniques,
making feature extraction challenging. Finally, the dataset of such available
APTs is extremely small.
In this paper we describe how deep neural networks (DNN) could be
successfully employed for nation-state APT attribution. We use sandbox reports
(recording the behavior of the APT when run dynamically) as raw input for the
neural network, allowing the DNN to learn high level feature abstractions of
the APTs itself. Using a test set of 1,000 Chinese and Russian developed APTs,
we achieved an accuracy rate of 94.6%
Probing the topological properties of complex networks modeling short written texts
In recent years, graph theory has been widely employed to probe several
language properties. More specifically, the so-called word adjacency model has
been proven useful for tackling several practical problems, especially those
relying on textual stylistic analysis. The most common approach to treat texts
as networks has simply considered either large pieces of texts or entire books.
This approach has certainly worked well -- many informative discoveries have
been made this way -- but it raises an uncomfortable question: could there be
important topological patterns in small pieces of texts? To address this
problem, the topological properties of subtexts sampled from entire books was
probed. Statistical analyzes performed on a dataset comprising 50 novels
revealed that most of the traditional topological measurements are stable for
short subtexts. When the performance of the authorship recognition task was
analyzed, it was found that a proper sampling yields a discriminability similar
to the one found with full texts. Surprisingly, the support vector machine
classification based on the characterization of short texts outperformed the
one performed with entire books. These findings suggest that a local
topological analysis of large documents might improve its global
characterization. Most importantly, it was verified, as a proof of principle,
that short texts can be analyzed with the methods and concepts of complex
networks. As a consequence, the techniques described here can be extended in a
straightforward fashion to analyze texts as time-varying complex networks
Text authorship identified using the dynamics of word co-occurrence networks
The identification of authorship in disputed documents still requires human
expertise, which is now unfeasible for many tasks owing to the large volumes of
text and authors in practical applications. In this study, we introduce a
methodology based on the dynamics of word co-occurrence networks representing
written texts to classify a corpus of 80 texts by 8 authors. The texts were
divided into sections with equal number of linguistic tokens, from which time
series were created for 12 topological metrics. The series were proven to be
stationary (p-value>0.05), which permits to use distribution moments as
learning attributes. With an optimized supervised learning procedure using a
Radial Basis Function Network, 68 out of 80 texts were correctly classified,
i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in
purely dynamic network metrics were found to characterize authorship, thus
opening the way for the description of texts in terms of small evolving
networks. Moreover, the approach introduced allows for comparison of texts with
diverse characteristics in a simple, fast fashion
Fighting Authorship Linkability with Crowdsourcing
Massive amounts of contributed content -- including traditional literature,
blogs, music, videos, reviews and tweets -- are available on the Internet
today, with authors numbering in many millions. Textual information, such as
product or service reviews, is an important and increasingly popular type of
content that is being used as a foundation of many trendy community-based
reviewing sites, such as TripAdvisor and Yelp. Some recent results have shown
that, due partly to their specialized/topical nature, sets of reviews authored
by the same person are readily linkable based on simple stylometric features.
In practice, this means that individuals who author more than a few reviews
under different accounts (whether within one site or across multiple sites) can
be linked, which represents a significant loss of privacy.
In this paper, we start by showing that the problem is actually worse than
previously believed. We then explore ways to mitigate authorship linkability in
community-based reviewing. We first attempt to harness the global power of
crowdsourcing by engaging random strangers into the process of re-writing
reviews. As our empirical results (obtained from Amazon Mechanical Turk)
clearly demonstrate, crowdsourcing yields impressively sensible reviews that
reflect sufficiently different stylometric characteristics such that prior
stylometric linkability techniques become largely ineffective. We also consider
using machine translation to automatically re-write reviews. Contrary to what
was previously believed, our results show that translation decreases authorship
linkability as the number of intermediate languages grows. Finally, we explore
the combination of crowdsourcing and machine translation and report on the
results
- …