17,310 research outputs found
Term-Specific Eigenvector-Centrality in Multi-Relation Networks
Fuzzy matching and ranking are two information retrieval techniques widely used in web search. Their application to structured data, however, remains an open problem. This article investigates how eigenvector-centrality can be used for approximate matching in multi-relation graphs, that is, graphs where connections of many different types may exist. Based on an extension of the PageRank matrix, eigenvectors representing the distribution of a term after propagating term weights between related data items are computed. The result is an index which takes the document structure into account and can be used with standard document retrieval techniques. As the scheme takes the shape of an index transformation, all necessary calculations are performed during index tim
Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval
Semantic similarity based retrieval is playing an increasingly important role
in many IR systems such as modern web search, question-answering, similar
document retrieval etc. Improvements in retrieval of semantically similar
content are very significant to applications like Quora, Stack Overflow, Siri
etc. We propose a novel unsupervised model for semantic similarity based
content retrieval, where we construct semantic flow graphs for each query, and
introduce the concept of "soft seeding" in graph based semi-supervised learning
(SSL) to convert this into an unsupervised model.
We demonstrate the effectiveness of our model on an equivalent question
retrieval problem on the Stack Exchange QA dataset, where our unsupervised
approach significantly outperforms the state-of-the-art unsupervised models,
and produces comparable results to the best supervised models. Our research
provides a method to tackle semantic similarity based retrieval without any
training data, and allows seamless extension to different domain QA
communities, as well as to other semantic equivalence tasks.Comment: Published in Proceedings of the 2017 ACM Conference on Information
and Knowledge Management (CIKM '17
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
Formal models, usability and related work in IR (editorial for special edition)
The Glasgow IR group has carried out both theoretical and empirical work, aimed at giving end users efficient and effective access to large collections of multimedia data
- …