2,250 research outputs found
Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity
Semantic similarity measures are an important part in Natural Language
Processing tasks. However Semantic similarity measures built for general use do
not perform well within specific domains. Therefore in this study we introduce
a domain specific semantic similarity measure that was created by the
synergistic union of word2vec, a word embedding method that is used for
semantic similarity calculation and lexicon based (lexical) semantic similarity
methods. We prove that this proposed methodology out performs word embedding
methods trained on generic corpus and methods trained on domain specific corpus
but do not use lexical semantic similarity methods to augment the results.
Further, we prove that text lemmatization can improve the performance of word
embedding methods.Comment: 6 Pages, 3 figure
A New Approach for Measuring Sentiment Orientation based on Multi-Dimensional Vector Space
This study implements a vector space model approach to measure the sentiment
orientations of words. Two representative vectors for positive/negative
polarity are constructed using high-dimensional vec-tor space in both an
unsupervised and a semi-supervised manner. A sentiment ori-entation value per
word is determined by taking the difference between the cosine distances
against the two reference vec-tors. These two conditions (unsupervised and
semi-supervised) are compared against an existing unsupervised method (Turney,
2002). As a result of our experi-ment, we demonstrate that this novel ap-proach
significantly outperforms the pre-vious unsupervised approach and is more
practical and data efficient as well.Comment: 8 page
Semantic Document Distance Measures and Unsupervised Document Revision Detection
In this paper, we model the document revision detection problem as a minimum
cost branching problem that relies on computing document distances.
Furthermore, we propose two new document distance measures, word vector-based
Dynamic Time Warping (wDTW) and word vector-based Tree Edit Distance (wTED).
Our revision detection system is designed for a large scale corpus and
implemented in Apache Spark. We demonstrate that our system can more precisely
detect revisions than state-of-the-art methods by utilizing the Wikipedia
revision dumps https://snap.stanford.edu/data/wiki-meta.html and simulated data
sets
Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings
Selecting a representative vector for a set of vectors is a very common
requirement in many algorithmic tasks. Traditionally, the mean or median vector
is selected. Ontology classes are sets of homogeneous instance objects that can
be converted to a vector space by word vector embeddings. This study proposes a
methodology to derive a representative vector for ontology classes whose
instances were converted to the vector space. We start by deriving five
candidate vectors which are then used to train a machine learning model that
would calculate a representative vector for the class. We show that our
methodology out-performs the traditional mean and median vector
representations
Predicting Relevance Scores for Triples from Type-Like Relations using Neural Embedding - The Cabbage Triple Scorer at WSDM Cup 2017
The WSDM Cup 2017 Triple scoring challenge is aimed at calculating and
assigning relevance scores for triples from type-like relations. Such scores
are a fundamental ingredient for ranking results in entity search. In this
paper, we propose a method that uses neural embedding techniques to accurately
calculate an entity score for a triple based on its nearest neighbor. We strive
to develop a new latent semantic model with a deep structure that captures the
semantic and syntactic relations between words. Our method has been ranked
among the top performers with accuracy - 0.74, average score difference - 1.74,
and average Kendall's Tau - 0.35.Comment: Triple Scorer at WSDM Cup 2017, see arXiv:1712.0808
Time-sync Video Tag Extraction Using Semantic Association Graph
Time-sync comments reveal a new way of extracting the online video tags.
However, such time-sync comments have lots of noises due to users' diverse
comments, introducing great challenges for accurate and fast video tag
extractions. In this paper, we propose an unsupervised video tag extraction
algorithm named Semantic Weight-Inverse Document Frequency (SW-IDF).
Specifically, we first generate corresponding semantic association graph (SAG)
using semantic similarities and timestamps of the time-sync comments. Second,
we propose two graph cluster algorithms, i.e., dialogue-based algorithm and
topic center-based algorithm, to deal with the videos with different density of
comments. Third, we design a graph iteration algorithm to assign the weight to
each comment based on the degrees of the clustered subgraphs, which can
differentiate the meaningful comments from the noises. Finally, we gain the
weight of each word by combining Semantic Weight (SW) and Inverse Document
Frequency (IDF). In this way, the video tags are extracted automatically in an
unsupervised way. Extensive experiments have shown that SW-IDF (dialogue-based
algorithm) achieves 0.4210 F1-score and 0.4932 MAP (Mean Average Precision) in
high-density comments, 0.4267 F1-score and 0.3623 MAP in low-density comments;
while SW-IDF (topic center-based algorithm) achieves 0.4444 F1-score and 0.5122
MAP in high-density comments, 0.4207 F1-score and 0.3522 MAP in low-density
comments. It has a better performance than the state-of-the-art unsupervised
algorithms in both F1-score and MAP.Comment: Accepted by ACM TKDD 201
A novel recommendation system to match college events and groups to students
With the recent increase in data online, discovering meaningful opportunities
can be time-consuming and complicated for many individuals. To overcome this
data overload challenge, we present a novel text-content-based recommender
system as a valuable tool to predict user interests. To that end, we develop a
specific procedure to create user models and item feature-vectors, where items
are described in free text. The user model is generated by soliciting from a
user a few keywords and expanding those keywords into a list of weighted
near-synonyms. The item feature-vectors are generated from the textual
descriptions of the items, using modified tf-idf values of the users' keywords
and their near-synonyms. Once the users are modeled and the items are
abstracted into feature vectors, the system returns the maximum-similarity
items as recommendations to that user. Our experimental evaluation shows that
our method of creating the user models and item feature-vectors resulted in
higher precision and accuracy in comparison to well-known
feature-vector-generating methods like Glove and Word2Vec. It also shows that
stemming and the use of a modified version of tf-idf increase the accuracy and
precision by 2% and 3%, respectively, compared to non-stemming and the standard
tf-idf definition. Moreover, the evaluation results show that updating the user
model from usage histories improves the precision and accuracy of the system.
This recommender system has been developed as part of the Agnes application,
which runs on iOS and Android platforms and is accessible through the Agnes
website.Comment: 10 pages, AIAAT 2017, Hawaii, US
Model Comparison for Semantic Grouping
We introduce a probabilistic framework for quantifying the semantic
similarity between two groups of embeddings. We formulate the task of semantic
similarity as a model comparison task in which we contrast a generative model
which jointly models two sentences versus one that does not. We illustrate how
this framework can be used for the Semantic Textual Similarity tasks using
clear assumptions about how the embeddings of words are generated. We apply
model comparison that utilises information criteria to address some of the
shortcomings of Bayesian model comparison, whilst still penalising model
complexity. We achieve competitive results by applying the proposed framework
with an appropriate choice of likelihood on the STS datasets.Comment: Proceedings of the 36th International Conference on Machine Learnin
Semantic Word Clusters Using Signed Normalized Graph Cuts
Vector space representations of words capture many aspects of word
similarity, but such methods tend to make vector spaces in which antonyms (as
well as synonyms) are close to each other. We present a new signed spectral
normalized graph cut algorithm, signed clustering, that overlays existing
thesauri upon distributionally derived vector representations of words, so that
antonym relationships between word pairs are represented by negative weights.
Our signed clustering algorithm produces clusters of words which simultaneously
capture distributional and synonym relations. We evaluate these clusters
against the SimLex-999 dataset (Hill et al.,2014) of human judgments of word
pair similarities, and also show the benefit of using our clusters to predict
the sentiment of a given text
Constructing Financial Sentimental Factors in Chinese Market Using Natural Language Processing
In this paper, we design an integrated algorithm to evaluate the sentiment of
Chinese market. Firstly, with the help of the web browser automation, we crawl
a lot of news and comments from several influential financial websites
automatically. Secondly, we use techniques of Natural Language Processing(NLP)
under Chinese context, including tokenization, Word2vec word embedding and
semantic database WordNet, to compute Senti-scores of these news and comments,
and then construct the sentimental factor. Here, we build a finance-specific
sentimental lexicon so that the sentimental factor can reflect the sentiment of
financial market but not the general sentiments as happiness, sadness, etc.
Thirdly, we also implement an adjustment of the standard sentimental factor.
Our experimental performance shows that there is a significant correlation
between our standard sentimental factor and the Chinese market, and the
adjusted factor is even more informative, having a stronger correlation with
the Chinese market. Therefore, our sentimental factors can be important
references when making investment decisions. Especially during the Chinese
market crash in 2015, the Pearson correlation coefficient of adjusted
sentimental factor with SSE is 0.5844, which suggests that our model can
provide a solid guidance, especially in the special period when the market is
influenced greatly by public sentiment
- …