248 research outputs found
The detection of markovian sequences of signals technical report no. 23
Detection of pure tone embedded in noise - Markov chains of signal
The Potential of Learned Index Structures for Index Compression
Inverted indexes are vital in providing fast key-word-based search. For every
term in the document collection, a list of identifiers of documents in which
the term appears is stored, along with auxiliary information such as term
frequency, and position offsets. While very effective, inverted indexes have
large memory requirements for web-sized collections. Recently, the concept of
learned index structures was introduced, where machine learned models replace
common index structures such as B-tree-indexes, hash-indexes, and
bloom-filters. These learned index structures require less memory, and can be
computationally much faster than their traditional counterparts. In this paper,
we consider whether such models may be applied to conjunctive Boolean querying.
First, we investigate how a learned model can replace document postings of an
inverted index, and then evaluate the compromises such an approach might have.
Second, we evaluate the potential gains that can be achieved in terms of memory
requirements. Our work shows that learned models have great potential in
inverted indexing, and this direction seems to be a promising area for future
research.Comment: Will appear in the proceedings of ADCS'1
Compressed Bit-sliced Signature Files An Index Structure for Large Lexicons
We use the signature file method to search for partially specified terms in large lexicons. To optimize efficiency, we use the concepts of the partially evaluated bit-sliced signature file method and memory resident data structures. Our system employs signature partitioning, compression, and term blocking. We derive equations to obtain system design parameters, and measure indexing efficiency in terms of time and space. The resulting approach provides good response time and is storage-efficient. In the experiments we use four different lexicons, and show that the signature file approach outperforms the inverted file approach in certain efficiency aspects.
KEYWORDS: Lexicon search, n-grams, signature files
Incremental Test Collections
Corpora and topics are readily available for information retrieval research. Relevance judgments, which are necessary for system evaluation, are expensive; the cost of obtaining them prohibits in-house evaluation of retrieval systems on new corpora or new topics. We present an algorithm for cheaply constructing sets of relevance judgments. Our method intelligently selects documents to be judged and decides when to stop in such a way that with very little work there can be a high degree of condence in the result of the evaluation. We demonstrate the algorithm\u27s eectiveness by showing that it produces small sets of relevance judgments that reliably discriminate between two systems. The algorithm can be used to incrementally design retrieval systems by simultaneously comparing sets of systems. The number of additional judgments needed after each incremental design change decreases at a rate reciprocal to the number of systems being compared. To demonstrate the eectiveness of our method, we evaluate TREC ad hoc submissions, showing that with 95% fewer relevance judgments we can reach a Kendall\u27s tau rank correlation of at least 0.9
Query Resolution for Conversational Search with Limited Supervision
In this work we focus on multi-turn passage retrieval as a crucial component
of conversational search. One of the key challenges in multi-turn passage
retrieval comes from the fact that the current turn query is often
underspecified due to zero anaphora, topic change, or topic return. Context
from the conversational history can be used to arrive at a better expression of
the current turn query, defined as the task of query resolution. In this paper,
we model the query resolution task as a binary term classification problem: for
each term appearing in the previous turns of the conversation decide whether to
add it to the current turn query or not. We propose QuReTeC (Query Resolution
by Term Classification), a neural query resolution model based on bidirectional
transformers. We propose a distant supervision method to automatically generate
training data by using query-passage relevance labels. Such labels are often
readily available in a collection either as human annotations or inferred from
user interactions. We show that QuReTeC outperforms state-of-the-art models,
and furthermore, that our distant supervision method can be used to
substantially reduce the amount of human-curated data required to train
QuReTeC. We incorporate QuReTeC in a multi-turn, multi-stage passage retrieval
architecture and demonstrate its effectiveness on the TREC CAsT dataset.Comment: SIGIR 2020 full conference pape
A Topological Method for Comparing Document Semantics
Comparing document semantics is one of the toughest tasks in both Natural
Language Processing and Information Retrieval. To date, on one hand, the tools
for this task are still rare. On the other hand, most relevant methods are
devised from the statistic or the vector space model perspectives but nearly
none from a topological perspective. In this paper, we hope to make a different
sound. A novel algorithm based on topological persistence for comparing
semantics similarity between two documents is proposed. Our experiments are
conducted on a document dataset with human judges' results. A collection of
state-of-the-art methods are selected for comparison. The experimental results
show that our algorithm can produce highly human-consistent results, and also
beats most state-of-the-art methods though ties with NLTK.Comment: 9 pages, 3 tables, 9th International Conference on Natural Language
Processing (NLP 2020
How Am I Doing?: Evaluating Conversational Search Systems Offline
As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search
On rank correlation and the distance between rankings
Rank correlation statistics are useful for determining whether a there is a correspondence between two measurements, par-ticularly when the measures themselves are of less interest than their relative ordering. Kendall’s τ in particular has found use in Information Retrieval as a “meta-evaluation” measure: it has been used to compare evaluation measures, evaluate system rankings, and evaluate predicted perfor-mance. In the meta-evaluation domain, however, correla-tions between systems confound relationships between mea-surements, practically guaranteeing a positive and signifi-cant estimate of τ regardless of any actual correlation be-tween the measurements. We introduce an alternative mea-sure of distance between rankings that corrects this by ex-plicitly accounting for correlations between systems over a sample of topics, and moreover has a probabilistic interpre-tation for use in a test of statistical significance. We validate our measure with theory, simulated data, and experiment
- …