10,822 research outputs found
Malware Classification using Graph Neural Networks
Word embeddings are widely recognized as important in natural language pro- cessing for capturing semantic relationships between words. In this study, we conduct experiments to explore the effectiveness of word embedding techniques in classifying malware. Specifically, we evaluate the performance of Graph Neural Network (GNN) applied to knowledge graphs constructed from opcode sequences of malware files. In the first set of experiments, Graph Convolution Network (GCN) is applied to knowledge graphs built with different word embedding techniques such as Bag-of-words, TF-IDF, and Word2Vec. Our results indicate that Word2Vec produces the most effective word embeddings, serving as a baseline for comparison with three GNN models- Graph Convolution network, Graph Attention network (GAT), and GraphSAGE network
(GraphSAGE). For the next set of experiments, we generate vector embeddings of various lengths using Word2Vec and construct knowledge graphs with these embed- dings as node features. Through performance comparison of the GNN models, we show that larger vector embeddings improve the models’ performance in classifying the malware files into their respective families. Our experiments demonstrate that word embedding techniques can enhance feature engineering in malware analysis
Word Embeddings for Entity-annotated Texts
Learned vector representations of words are useful tools for many information
retrieval and natural language processing tasks due to their ability to capture
lexical semantics. However, while many such tasks involve or even rely on named
entities as central components, popular word embedding models have so far
failed to include entities as first-class citizens. While it seems intuitive
that annotating named entities in the training corpus should result in more
intelligent word features for downstream tasks, performance issues arise when
popular embedding approaches are naively applied to entity annotated corpora.
Not only are the resulting entity embeddings less useful than expected, but one
also finds that the performance of the non-entity word embeddings degrades in
comparison to those trained on the raw, unannotated corpus. In this paper, we
investigate approaches to jointly train word and entity embeddings on a large
corpus with automatically annotated and linked entities. We discuss two
distinct approaches to the generation of such embeddings, namely the training
of state-of-the-art embeddings on raw-text and annotated versions of the
corpus, as well as node embeddings of a co-occurrence graph representation of
the annotated corpus. We compare the performance of annotated embeddings and
classical word embeddings on a variety of word similarity, analogy, and
clustering evaluation tasks, and investigate their performance in
entity-specific tasks. Our findings show that it takes more than training
popular word embedding models on an annotated corpus to create entity
embeddings with acceptable performance on common test cases. Based on these
results, we discuss how and when node embeddings of the co-occurrence graph
representation of the text can restore the performance.Comment: This paper is accepted in 41st European Conference on Information
Retrieva
Inducing Language Networks from Continuous Space Word Representations
Recent advancements in unsupervised feature learning have developed powerful
latent representations of words. However, it is still not clear what makes one
representation better than another and how we can learn the ideal
representation. Understanding the structure of latent spaces attained is key to
any future advancement in unsupervised learning. In this work, we introduce a
new view of continuous space word representations as language networks. We
explore two techniques to create language networks from learned features by
inducing them for two popular word representation methods and examining the
properties of their resulting networks. We find that the induced networks
differ from other methods of creating language networks, and that they contain
meaningful community structure.Comment: 14 page
Improving Natural Language Inference Using External Knowledge in the Science Questions Domain
Natural Language Inference (NLI) is fundamental to many Natural Language
Processing (NLP) applications including semantic search and question answering.
The NLI problem has gained significant attention thanks to the release of large
scale, challenging datasets. Present approaches to the problem largely focus on
learning-based methods that use only textual information in order to classify
whether a given premise entails, contradicts, or is neutral with respect to a
given hypothesis. Surprisingly, the use of methods based on structured
knowledge -- a central topic in artificial intelligence -- has not received
much attention vis-a-vis the NLI problem. While there are many open knowledge
bases that contain various types of reasoning information, their use for NLI
has not been well explored. To address this, we present a combination of
techniques that harness knowledge graphs to improve performance on the NLI
problem in the science questions domain. We present the results of applying our
techniques on text, graph, and text-to-graph based models, and discuss
implications for the use of external knowledge in solving the NLI problem. Our
model achieves the new state-of-the-art performance on the NLI problem over the
SciTail science questions dataset.Comment: 9 pages, 3 figures, 5 table
Entity Type Prediction in Knowledge Graphs using Embeddings
Open Knowledge Graphs (such as DBpedia, Wikidata, YAGO) have been recognized
as the backbone of diverse applications in the field of data mining and
information retrieval. Hence, the completeness and correctness of the Knowledge
Graphs (KGs) are vital. Most of these KGs are mostly created either via an
automated information extraction from Wikipedia snapshots or information
accumulation provided by the users or using heuristics. However, it has been
observed that the type information of these KGs is often noisy, incomplete, and
incorrect. To deal with this problem a multi-label classification approach is
proposed in this work for entity typing using KG embeddings. We compare our
approach with the current state-of-the-art type prediction method and report on
experiments with the KGs
Sampled in Pairs and Driven by Text: A New Graph Embedding Framework
In graphs with rich texts, incorporating textual information with structural
information would benefit constructing expressive graph embeddings. Among
various graph embedding models, random walk (RW)-based is one of the most
popular and successful groups. However, it is challenged by two issues when
applied on graphs with rich texts: (i) sampling efficiency: deriving from the
training objective of RW-based models (e.g., DeepWalk and node2vec), we show
that RW-based models are likely to generate large amounts of redundant training
samples due to three main drawbacks. (ii) text utilization: these models have
difficulty in dealing with zero-shot scenarios where graph embedding models
have to infer graph structures directly from texts. To solve these problems, we
propose a novel framework, namely Text-driven Graph Embedding with Pairs
Sampling (TGE-PS). TGE-PS uses Pairs Sampling (PS) to improve the sampling
strategy of RW, being able to reduce ~99% training samples while preserving
competitive performance. TGE-PS uses Text-driven Graph Embedding (TGE), an
inductive graph embedding approach, to generate node embeddings from texts.
Since each node contains rich texts, TGE is able to generate high-quality
embeddings and provide reasonable predictions on existence of links to unseen
nodes. We evaluate TGE-PS on several real-world datasets, and experiment
results demonstrate that TGE-PS produces state-of-the-art results on both
traditional and zero-shot link prediction tasks.Comment: Accepted by WWW 2019 (The World Wide Web Conference. ACM, 2019
- …