16,488 research outputs found
Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts
Word embedding is a Natural Language Processing (NLP) technique that
automatically maps words from a vocabulary to vectors of real numbers in an
embedding space. It has been widely used in recent years to boost the
performance of a vari-ety of NLP tasks such as Named Entity Recognition,
Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such
as Word2Vec and GloVe work well when they are given a large text corpus. When
the input texts are sparse as in many specialized domains (e.g.,
cybersecurity), these methods often fail to produce high-quality vectors. In
this pa-per, we describe a novel method to train domain-specificword embeddings
from sparse texts. In addition to domain texts, our method also leverages
diverse types of domain knowledge such as domain vocabulary and semantic
relations. Specifi-cally, we first propose a general framework to encode
diverse types of domain knowledge as text annotations. Then we de-velop a novel
Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text
annotations in word em-bedding. We have evaluated our method on two
cybersecurity text corpora: a malware description corpus and a Common
Vulnerability and Exposure (CVE) corpus. Our evaluation re-sults have
demonstrated the effectiveness of our method in learning domain-specific word
embeddings
BattRAE: Bidimensional Attention-Based Recursive Autoencoders for Learning Bilingual Phrase Embeddings
In this paper, we propose a bidimensional attention based recursive
autoencoder (BattRAE) to integrate clues and sourcetarget interactions at
multiple levels of granularity into bilingual phrase representations. We employ
recursive autoencoders to generate tree structures of phrases with embeddings
at different levels of granularity (e.g., words, sub-phrases and phrases). Over
these embeddings on the source and target side, we introduce a bidimensional
attention network to learn their interactions encoded in a bidimensional
attention matrix, from which we extract two soft attention weight distributions
simultaneously. These weight distributions enable BattRAE to generate
compositive phrase representations via convolution. Based on the learned phrase
representations, we further use a bilinear neural model, trained via a
max-margin method, to measure bilingual semantic similarity. To evaluate the
effectiveness of BattRAE, we incorporate this semantic similarity as an
additional feature into a state-of-the-art SMT system. Extensive experiments on
NIST Chinese-English test sets show that our model achieves a substantial
improvement of up to 1.63 BLEU points on average over the baseline.Comment: 7 pages, accepted by AAAI 201
From Paraphrase Database to Compositional Paraphrase Model and Back
The Paraphrase Database (PPDB; Ganitkevitch et al., 2013) is an extensive
semantic resource, consisting of a list of phrase pairs with (heuristic)
confidence estimates. However, it is still unclear how it can best be used, due
to the heuristic nature of the confidences and its necessarily incomplete
coverage. We propose models to leverage the phrase pairs from the PPDB to build
parametric paraphrase models that score paraphrase pairs more accurately than
the PPDB's internal scores while simultaneously improving its coverage. They
allow for learning phrase embeddings as well as improved word embeddings.
Moreover, we introduce two new, manually annotated datasets to evaluate
short-phrase paraphrasing models. Using our paraphrase model trained using
PPDB, we achieve state-of-the-art results on standard word and bigram
similarity tasks and beat strong baselines on our new short phrase paraphrase
tasks.Comment: 2015 TACL paper updated with an appendix describing new 300
dimensional embeddings. Submitted 1/2015. Accepted 2/2015. Published 6/201
- …