3,494 research outputs found
Characterizing Phishing Threats with Natural Language Processing
Spear phishing is a widespread concern in the modern network security
landscape, but there are few metrics that measure the extent to which
reconnaissance is performed on phishing targets. Spear phishing emails closely
match the expectations of the recipient, based on details of their experiences
and interests, making them a popular propagation vector for harmful malware. In
this work we use Natural Language Processing techniques to investigate a
specific real-world phishing campaign and quantify attributes that indicate a
targeted spear phishing attack. Our phishing campaign data sample comprises 596
emails - all containing a web bug and a Curriculum Vitae (CV) PDF attachment -
sent to our institution by a foreign IP space. The campaign was found to
exclusively target specific demographics within our institution. Performing a
semantic similarity analysis between the senders' CV attachments and the
recipients' LinkedIn profiles, we conclude with high statistical certainty (p
) that the attachments contain targeted rather than randomly
selected material. Latent Semantic Analysis further demonstrates that
individuals who were a primary focus of the campaign received CVs that are
highly topically clustered. These findings differentiate this campaign from one
that leverages random spam.Comment: This paper has been accepted for publication by the IEEE Conference
on Communications and Network Security in September 2015 at Florence, Italy.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
The linguistic patterns and rhetorical structure of citation context : an approach using n-grams
Using the full-text corpus of more than 75,000 research articles published by seven PLOS journals, this paper
proposes a natural language processing approach for identifying the function of citations. Citation contexts are
assigned based on the frequency of n-gram co-occurrences located near the citations. Results show that the most
frequent linguistic patterns found in the citation contexts of papers vary according to their location in the IMRaD
structure of scientific articles. The presence of negative citations is also dependent on this structure. This
methodology offers new perspectives to locate these discursive forms according to the rhetorical structure of
scientific articles, and will lead to a better understanding of the use of citations in scientific articles
Neural Embeddings of Graphs in Hyperbolic Space
Neural embeddings have been used with great success in Natural Language
Processing (NLP). They provide compact representations that encapsulate word
similarity and attain state-of-the-art performance in a range of linguistic
tasks. The success of neural embeddings has prompted significant amounts of
research into applications in domains other than language. One such domain is
graph-structured data, where embeddings of vertices can be learned that
encapsulate vertex similarity and improve performance on tasks including edge
prediction and vertex labelling. For both NLP and graph based tasks, embeddings
have been learned in high-dimensional Euclidean spaces. However, recent work
has shown that the appropriate isometric space for embedding complex networks
is not the flat Euclidean space, but negatively curved, hyperbolic space. We
present a new concept that exploits these recent insights and propose learning
neural embeddings of graphs in hyperbolic space. We provide experimental
evidence that embedding graphs in their natural geometry significantly improves
performance on downstream tasks for several real-world public datasets.Comment: 7 pages, 5 figure
Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution
It is tempting to treat frequency trends from the Google Books data sets as indicators of the true popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800-2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution
- …