124 research outputs found
REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security Forums
How can we extract useful information from a security forum? We focus on
identifying threads of interest to a security professional: (a) alerts of
worrisome events, such as attacks, (b) offering of malicious services and
products, (c) hacking information to perform malicious acts, and (d) useful
security-related experiences. The analysis of security forums is in its infancy
despite several promising recent works. Novel approaches are needed to address
the challenges in this domain: (a) the difficulty in specifying the "topics" of
interest efficiently, and (b) the unstructured and informal nature of the text.
We propose, REST, a systematic methodology to: (a) identify threads of interest
based on a, possibly incomplete, bag of words, and (b) classify them into one
of the four classes above. The key novelty of the work is a multi-step weighted
embedding approach: we project words, threads and classes in appropriate
embedding spaces and establish relevance and similarity there. We evaluate our
method with real data from three security forums with a total of 164k posts and
21K threads. First, REST robustness to initial keyword selection can extend the
user-provided keyword set and thus, it can recover from missing keywords.
Second, REST categorizes the threads into the classes of interest with superior
accuracy compared to five other methods: REST exhibits an accuracy between
63.3-76.9%. We see our approach as a first step for harnessing the wealth of
information of online forums in a user-friendly way, since the user can loosely
specify her keywords of interest
SamBaTen: Sampling-based Batch Incremental Tensor Decomposition
Tensor decompositions are invaluable tools in analyzing multimodal datasets.
In many real-world scenarios, such datasets are far from being static, to the
contrary they tend to grow over time. For instance, in an online social network
setting, as we observe new interactions over time, our dataset gets updated in
its "time" mode. How can we maintain a valid and accurate tensor decomposition
of such a dynamically evolving multimodal dataset, without having to re-compute
the entire decomposition after every single update? In this paper we introduce
SaMbaTen, a Sampling-based Batch Incremental Tensor Decomposition algorithm,
which incrementally maintains the decomposition given new updates to the tensor
dataset. SaMbaTen is able to scale to datasets that the state-of-the-art in
incremental tensor decomposition is unable to operate on, due to its ability to
effectively summarize the existing tensor and the incoming updates, and perform
all computations in the reduced summary space. We extensively evaluate SaMbaTen
using synthetic and real datasets. Indicatively, SaMbaTen achieves comparable
accuracy to state-of-the-art incremental and non-incremental techniques, while
being 25-30 times faster. Furthermore, SaMbaTen scales to very large sparse and
dense dynamically evolving tensors of dimensions up to 100K x 100K x 100K where
state-of-the-art incremental approaches were not able to operate
Community detection in multiplex networks using locally adaptive random walks
Multiplex networks, a special type of multilayer networks, are increasingly
applied in many domains ranging from social media analytics to biology. A
common task in these applications concerns the detection of community
structures. Many existing algorithms for community detection in multiplexes
attempt to detect communities which are shared by all layers. In this article
we propose a community detection algorithm, LART (Locally Adaptive Random
Transitions), for the detection of communities that are shared by either some
or all the layers in the multiplex. The algorithm is based on a random walk on
the multiplex, and the transition probabilities defining the random walk are
allowed to depend on the local topological similarity between layers at any
given node so as to facilitate the exploration of communities across layers.
Based on this random walk, a node dissimilarity measure is derived and nodes
are clustered based on this distance in a hierarchical fashion. We present
experimental results using networks simulated under various scenarios to
showcase the performance of LART in comparison to related community detection
algorithms
Ensemble Node Embeddings using Tensor Decomposition: A Case-Study on DeepWalk
Node embeddings have been attracting increasing attention during the past
years. In this context, we propose a new ensemble node embedding approach,
called TenSemble2Vec, by first generating multiple embeddings using the
existing techniques and taking them as multiview data input of the state-of-art
tensor decomposition model namely PARAFAC2 to learn the shared
lower-dimensional representations of the nodes. Contrary to other embedding
methods, our TenSemble2Vec takes advantage of the complementary information
from different methods or the same method with different hyper-parameters,
which bypasses the challenge of choosing models. Extensive tests using
real-world data validates the efficiency of the proposed method
- …