2,345 research outputs found
node2bits: Compact Time- and Attribute-aware Node Representations for User Stitching
Identity stitching, the task of identifying and matching various online
references (e.g., sessions over different devices and timespans) to the same
user in real-world web services, is crucial for personalization and
recommendations. However, traditional user stitching approaches, such as
grouping or blocking, require quadratic pairwise comparisons between a massive
number of user activities, thus posing both computational and storage
challenges. Recent works, which are often application-specific, heuristically
seek to reduce the amount of comparisons, but they suffer from low precision
and recall. To solve the problem in an application-independent way, we take a
heterogeneous network-based approach in which users (nodes) interact with
content (e.g., sessions, websites), and may have attributes (e.g., location).
We propose node2bits, an efficient framework that represents multi-dimensional
features of node contexts with binary hashcodes. node2bits leverages
feature-based temporal walks to encapsulate short- and long-term interactions
between nodes in heterogeneous web networks, and adopts SimHash to obtain
compact, binary representations and avoid the quadratic complexity for
similarity search. Extensive experiments on large-scale real networks show that
node2bits outperforms traditional techniques and existing works that generate
real-valued embeddings by up to 5.16% in F1 score on user stitching, while
taking only up to 1.56% as much storage
Multi-Level Network Embedding with Boosted Low-Rank Matrix Approximation
As opposed to manual feature engineering which is tedious and difficult to
scale, network representation learning has attracted a surge of research
interests as it automates the process of feature learning on graphs. The
learned low-dimensional node vector representation is generalizable and eases
the knowledge discovery process on graphs by enabling various off-the-shelf
machine learning tools to be directly applied. Recent research has shown that
the past decade of network embedding approaches either explicitly factorize a
carefully designed matrix to obtain the low-dimensional node vector
representation or are closely related to implicit matrix factorization, with
the fundamental assumption that the factorized node connectivity matrix is
low-rank. Nonetheless, the global low-rank assumption does not necessarily hold
especially when the factorized matrix encodes complex node interactions, and
the resultant single low-rank embedding matrix is insufficient to capture all
the observed connectivity patterns. In this regard, we propose a novel
multi-level network embedding framework BoostNE, which can learn multiple
network embedding representations of different granularity from coarse to fine
without imposing the prevalent global low-rank assumption. The proposed BoostNE
method is also in line with the successful gradient boosting method in ensemble
learning as multiple weak embeddings lead to a stronger and more effective one.
We assess the effectiveness of the proposed BoostNE framework by comparing it
with existing state-of-the-art network embedding methods on various datasets,
and the experimental results corroborate the superiority of the proposed
BoostNE network embedding framework
Is a Single Vector Enough? Exploring Node Polysemy for Network Embedding
Networks have been widely used as the data structure for abstracting
real-world systems as well as organizing the relations among entities. Network
embedding models are powerful tools in mapping nodes in a network into
continuous vector-space representations in order to facilitate subsequent tasks
such as classification and link prediction. Existing network embedding models
comprehensively integrate all information of each node, such as links and
attributes, towards a single embedding vector to represent the node's general
role in the network. However, a real-world entity could be multifaceted, where
it connects to different neighborhoods due to different motives or
self-characteristics that are not necessarily correlated. For example, in a
movie recommender system, a user may love comedies or horror movies
simultaneously, but it is not likely that these two types of movies are
mutually close in the embedding space, nor the user embedding vector could be
sufficiently close to them at the same time. In this paper, we propose a
polysemous embedding approach for modeling multiple facets of nodes, as
motivated by the phenomenon of word polysemy in language modeling. Each facet
of a node is mapped as an embedding vector, while we also maintain association
degree between each pair of node and facet. The proposed method is adaptive to
various existing embedding models, without significantly complicating the
optimization process. We also discuss how to engage embedding vectors of
different facets for inference tasks including classification and link
prediction. Experiments on real-world datasets help comprehensively evaluate
the performance of the proposed method
A Comprehensive Survey on Graph Neural Networks
Deep learning has revolutionized many machine learning tasks in recent years,
ranging from image classification and video processing to speech recognition
and natural language understanding. The data in these tasks are typically
represented in the Euclidean space. However, there is an increasing number of
applications where data are generated from non-Euclidean domains and are
represented as graphs with complex relationships and interdependency between
objects. The complexity of graph data has imposed significant challenges on
existing machine learning algorithms. Recently, many studies on extending deep
learning approaches for graph data have emerged. In this survey, we provide a
comprehensive overview of graph neural networks (GNNs) in data mining and
machine learning fields. We propose a new taxonomy to divide the
state-of-the-art graph neural networks into four categories, namely recurrent
graph neural networks, convolutional graph neural networks, graph autoencoders,
and spatial-temporal graph neural networks. We further discuss the applications
of graph neural networks across various domains and summarize the open source
codes, benchmark data sets, and model evaluation of graph neural networks.
Finally, we propose potential research directions in this rapidly growing
field.Comment: Minor revision (updated tables and references
Basic tasks of sentiment analysis
Subjectivity detection is the task of identifying objective and subjective
sentences. Objective sentences are those which do not exhibit any sentiment.
So, it is desired for a sentiment analysis engine to find and separate the
objective sentences for further analysis, e.g., polarity detection. In
subjective sentences, opinions can often be expressed on one or multiple
topics. Aspect extraction is a subtask of sentiment analysis that consists in
identifying opinion targets in opinionated text, i.e., in detecting the
specific aspects of a product or service the opinion holder is either praising
or complaining about
AiDroid: When Heterogeneous Information Network Marries Deep Neural Network for Real-time Android Malware Detection
The explosive growth and increasing sophistication of Android malware call
for new defensive techniques that are capable of protecting mobile users
against novel threats. In this paper, we first extract the runtime Application
Programming Interface (API) call sequences from Android apps, and then analyze
higher-level semantic relations within the ecosystem to comprehensively
characterize the apps. To model different types of entities (i.e., app, API,
IMEI, signature, affiliation) and the rich semantic relations among them, we
then construct a structural heterogeneous information network (HIN) and present
meta-path based approach to depict the relatedness over apps. To efficiently
classify nodes (e.g., apps) in the constructed HIN, we propose the HinLearning
method to first obtain in-sample node embeddings and then learn representations
of out-of-sample nodes without rerunning/adjusting HIN embeddings at the first
attempt. Afterwards, we design a deep neural network (DNN) classifier taking
the learned HIN representations as inputs for Android malware detection. A
comprehensive experimental study on the large-scale real sample collections
from Tencent Security Lab is performed to compare various baselines. Promising
experimental results demonstrate that our developed system AiDroid which
integrates our proposed method outperforms others in real-time Android malware
detection. AiDroid has already been incorporated into Tencent Mobile Security
product that serves millions of users worldwide.Comment: The revised version will be published in IJCAI'2019 entitled
"Out-of-sample Node Representation Learning for Heterogeneous Graph in
Real-time Android Malware Detection
Learning Dynamic Embeddings from Temporal Interactions
Modeling a sequence of interactions between users and items (e.g., products,
posts, or courses) is crucial in domains such as e-commerce, social networking,
and education to predict future interactions. Representation learning presents
an attractive solution to model the dynamic evolution of user and item
properties, where each user/item can be embedded in a euclidean space and its
evolution can be modeled by dynamic changes in embedding. However, existing
embedding methods either generate static embeddings, treat users and items
independently, or are not scalable.
Here we present JODIE, a coupled recurrent model to jointly learn the dynamic
embeddings of users and items from a sequence of user-item interactions. JODIE
has three components. First, the update component updates the user and item
embedding from each interaction using their previous embeddings with the two
mutually-recursive Recurrent Neural Networks. Second, a novel projection
component is trained to forecast the embedding of users at any future time.
Finally, the prediction component directly predicts the embedding of the item
in a future interaction. For models that learn from a sequence of interactions,
traditional training data batching cannot be done due to complex user-user
dependencies. Therefore, we present a novel batching algorithm called t-Batch
that generates time-consistent batches of training data that can run in
parallel, giving massive speed-up.
We conduct six experiments on two prediction tasks---future interaction
prediction and state change prediction---using four real-world datasets. We
show that JODIE outperforms six state-of-the-art algorithms in these tasks by
up to 22.4%. Moreover, we show that JODIE is highly scalable and up to 9.2x
faster than comparable models. As an additional experiment, we illustrate that
JODIE can predict student drop-out from courses five interactions in advance
CONE: Community Oriented Network Embedding
Detecting communities has long been popular in the research on networks. It
is usually modeled as an unsupervised clustering problem on graphs, based on
heuristic assumptions about community characteristics, such as edge density and
node homogeneity. In this work, we doubt the universality of these widely
adopted assumptions and compare human labeled communities with machine
predicted ones obtained via various mainstream algorithms. Based on supportive
results, we argue that communities are defined by various social patterns and
unsupervised learning based on heuristics is incapable of capturing all of
them. Therefore, we propose to inject supervision into community detection
through Community Oriented Network Embedding (CONE), which leverages limited
ground-truth communities as examples to learn an embedding model aware of the
social patterns underlying them. Specifically, a deep architecture is developed
by combining recurrent neural networks with random-walks on graphs towards
capturing social patterns directed by ground-truth communities. Generic
clustering algorithms on the embeddings of other nodes produced by the learned
model then effectively reveals more communities that share similar social
patterns with the ground-truth ones.Comment: 10 pages, accepted by IJCNN 201
A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications
Graph is an important data representation which appears in a wide diversity
of real-world scenarios. Effective graph analytics provides users a deeper
understanding of what is behind the data, and thus can benefit a lot of useful
applications such as node classification, node recommendation, link prediction,
etc. However, most graph analytics methods suffer the high computation and
space cost. Graph embedding is an effective yet efficient way to solve the
graph analytics problem. It converts the graph data into a low dimensional
space in which the graph structural information and graph properties are
maximally preserved. In this survey, we conduct a comprehensive review of the
literature in graph embedding. We first introduce the formal definition of
graph embedding as well as the related concepts. After that, we propose two
taxonomies of graph embedding which correspond to what challenges exist in
different graph embedding problem settings and how the existing work address
these challenges in their solutions. Finally, we summarize the applications
that graph embedding enables and suggest four promising future research
directions in terms of computation efficiency, problem settings, techniques and
application scenarios.Comment: A 20-page comprehensive survey of graph/network embedding for over
150+ papers till year 2018. It provides systematic categorization of
problems, techniques and applications. Accepted by IEEE Transactions on
Knowledge and Data Engineering (TKDE). Comments and suggestions are welcomed
for continuously improving this surve
Higher-order Spectral Clustering for Heterogeneous Graphs
Higher-order connectivity patterns such as small induced sub-graphs called
graphlets (network motifs) are vital to understand the important components
(modules/functional units) governing the configuration and behavior of complex
networks. Existing work in higher-order clustering has focused on simple
homogeneous graphs with a single node/edge type. However, heterogeneous graphs
consisting of nodes and edges of different types are seemingly ubiquitous in
the real-world. In this work, we introduce the notion of typed-graphlet that
explicitly captures the rich (typed) connectivity patterns in heterogeneous
networks. Using typed-graphlets as a basis, we develop a general principled
framework for higher-order clustering in heterogeneous networks. The framework
provides mathematical guarantees on the optimality of the higher-order
clustering obtained. The experiments demonstrate the effectiveness of the
framework quantitatively for three important applications including (i)
clustering, (ii) link prediction, and (iii) graph compression. In particular,
the approach achieves a mean improvement of 43x over all methods and graphs for
clustering while achieving a 18.7% and 20.8% improvement for link prediction
and graph compression, respectively
- …