609 research outputs found
POIReviewQA: A Semantically Enriched POI Retrieval and Question Answering Dataset
Many services that perform information retrieval for Points of Interest (POI)
utilize a Lucene-based setup with spatial filtering. While this type of system
is easy to implement it does not make use of semantics but relies on direct
word matches between a query and reviews leading to a loss in both precision
and recall. To study the challenging task of semantically enriching POIs from
unstructured data in order to support open-domain search and question answering
(QA), we introduce a new dataset POIReviewQA. It consists of 20k questions
(e.g."is this restaurant dog friendly?") for 1022 Yelp business types. For each
question we sampled 10 reviews, and annotated each sentence in the reviews
whether it answers the question and what the corresponding answer is. To test a
system's ability to understand the text we adopt an information retrieval
evaluation by ranking all the review sentences for a question based on the
likelihood that they answer this question. We build a Lucene-based baseline
model, which achieves 77.0% AUC and 48.8% MAP. A sentence embedding-based model
achieves 79.2% AUC and 41.8% MAP, indicating that the dataset presents a
challenging problem for future research by the GIR community. The result
technology can help exploit the thematic content of web documents and social
media for characterisation of locations
Investigating the Effects of Word Substitution Errors on Sentence Embeddings
A key initial step in several natural language processing (NLP) tasks
involves embedding phrases of text to vectors of real numbers that preserve
semantic meaning. To that end, several methods have been recently proposed with
impressive results on semantic similarity tasks. However, all of these
approaches assume that perfect transcripts are available when generating the
embeddings. While this is a reasonable assumption for analysis of written text,
it is limiting for analysis of transcribed text. In this paper we investigate
the effects of word substitution errors, such as those coming from automatic
speech recognition errors (ASR), on several state-of-the-art sentence embedding
methods. To do this, we propose a new simulator that allows the experimenter to
induce ASR-plausible word substitution errors in a corpus at a desired word
error rate. We use this simulator to evaluate the robustness of several
sentence embedding methods. Our results show that pre-trained neural sentence
encoders are both robust to ASR errors and perform well on textual similarity
tasks after errors are introduced. Meanwhile, unweighted averages of word
vectors perform well with perfect transcriptions, but their performance
degrades rapidly on textual similarity tasks for text with word substitution
errors.Comment: 4 Pages, 2 figures. Copyright IEEE 2019. Accepted and to appear in
the Proceedings of the 44th International Conference on Acoustics, Speech,
and Signal Processing 2019 (IEEE-ICASSP-2019), May 12-17 in Brighton, U.K.
Personal use of this material is permitted. However, permission to
reprint/republish this material must be obtained from the IEE
Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations
We investigate the pertinence of methods from algebraic topology for text
data analysis. These methods enable the development of
mathematically-principled isometric-invariant mappings from a set of vectors to
a document embedding, which is stable with respect to the geometry of the
document in the selected metric space. In this work, we evaluate the utility of
these topology-based document representations in traditional NLP tasks,
specifically document clustering and sentiment classification. We find that the
embeddings do not benefit text analysis. In fact, performance is worse than
simple techniques like , indicating that the geometry of the
document does not provide enough variability for classification on the basis of
topic or sentiment in the chosen datasets.Comment: 5 pages, 3 figures. Rep4NLP workshop at ACL 201
- …