11,005 research outputs found
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
Pytrec_eval: An Extremely Fast Python Interface to trec_eval
We introduce pytrec_eval, a Python interface to the tree_eval information
retrieval evaluation toolkit. pytrec_eval exposes the reference implementations
of trec_eval within Python as a native extension. We show that pytrec_eval is
around one order of magnitude faster than invoking trec_eval as a sub process
from within Python. Compared to a native Python implementation of NDCG,
pytrec_eval is twice as fast for practically-sized rankings. Finally, we
demonstrate its effectiveness in an application where pytrec_eval is combined
with Pyndri and the OpenAI Gym where query expansion is learned using
Q-learning.Comment: SIGIR '18. The 41st International ACM SIGIR Conference on Research &
Development in Information Retrieva
Supporting collocation learning with a digital library
Extensive knowledge of collocations is a key factor that distinguishes learners from fluent native speakers. Such knowledge is difficult to acquire simply because there is so much of it. This paper describes a system that exploits the facilities offered by digital libraries to provide a rich collocation-learning environment. The design is based on three processes that have been identified as leading to lexical acquisition: noticing, retrieval and generation. Collocations are automatically identified in input documents using natural language processing techniques and used to enhance the presentation of the documents and also as the basis of exercises, produced under teacher control, that amplify students' collocation knowledge. The system uses a corpus of 1.3 B short phrases drawn from the web, from which 29 M collocations have been automatically identified. It also connects to examples garnered from the live web and the British National Corpus
Query Expansion for Survey Question Retrieval in the Social Sciences
In recent years, the importance of research data and the need to archive and
to share it in the scientific community have increased enormously. This
introduces a whole new set of challenges for digital libraries. In the social
sciences typical research data sets consist of surveys and questionnaires. In
this paper we focus on the use case of social science survey question reuse and
on mechanisms to support users in the query formulation for data sets. We
describe and evaluate thesaurus- and co-occurrence-based approaches for query
expansion to improve retrieval quality in digital libraries and research data
archives. The challenge here is to translate the information need and the
underlying sociological phenomena into proper queries. As we can show retrieval
quality can be improved by adding related terms to the queries. In a direct
comparison automatically expanded queries using extracted co-occurring terms
can provide better results than queries manually reformulated by a domain
expert and better results than a keyword-based BM25 baseline.Comment: to appear in Proceedings of 19th International Conference on Theory
and Practice of Digital Libraries 2015 (TPDL 2015
Phonetic Searching
An improved method and apparatus is disclosed which uses probabilistic techniques to map an input search string with a prestored audio file, and recognize certain portions of a search string phonetically. An improved interface is disclosed which permits users to input search strings, linguistics, phonetics, or a combination of both, and also allows logic functions to be specified by indicating how far separated specific phonemes are in time.Georgia Tech Research Corporatio
Languages cool as they expand: Allometric scaling and the decreasing need for new words
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ‘‘cooling pattern’’ forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature
- …