356 research outputs found
Towards Query Logs for Privacy Studies: On Deriving Search Queries from Questions
Translating verbose information needs into crisp search queries is a
phenomenon that is ubiquitous but hardly understood. Insights into this process
could be valuable in several applications, including synthesizing large
privacy-friendly query logs from public Web sources which are readily available
to the academic research community. In this work, we take a step towards
understanding query formulation by tapping into the rich potential of community
question answering (CQA) forums. Specifically, we sample natural language (NL)
questions spanning diverse themes from the Stack Exchange platform, and conduct
a large-scale conversion experiment where crowdworkers submit search queries
they would use when looking for equivalent information. We provide a careful
analysis of this data, accounting for possible sources of bias during
conversion, along with insights into user-specific linguistic patterns and
search behaviors. We release a dataset of 7,000 question-query pairs from this
study to facilitate further research on query understanding.Comment: ECIR 2020 Short Pape
Recommended from our members
Retrieval Models based on Linguistic Features of Verbose Queries
Natural language expressions are more familiar to users than choosing keywords for queries. Given that, people can use natural language expressions to represent their sophisticated information needs. Instead of listing keywords, verbose queries are expressed in a grammatically well-formed phrase or sentence in which terms are used together to represent the more specific meanings of a concept, and the relationships of these concepts are expressed by function words.
The goal of this thesis is to investigate methods of using the semantic and syntactic features of natural language queries to maximize the effectiveness of search. For this purpose, we propose the synchronous framework in which we use syntactic parsing techniques for modeling term dependencies. We use the Generative Relevance Hypothesis (GRH) to evaluate valid variations in dependence relationships between queries and documents. This is one of the first results demonstrating that dependency parsing can be used to improve retrieval effectiveness.
We propose a method for classifying concepts in verbose queries as key concepts and secondary concepts that are used in the statistical translation model for query term expansion. Key concepts are the most important terms of queries. We use key concepts as the context for translating terms. Although secondary (key) concepts are not as important as key concepts, they are still important because they provide clues about what kinds of information users are looking for. Using concept classification results, we elaborate a translation model in which the key concepts of queries are used as the context of translation. The secondary concepts of queries are used to selectively apply the translation model to query terms.
We define the important new task of focused retrieval of answer passages that aims to immediately provide answers for users\u27 information needs while the length of answer passage should be suitable for restricted search environments such as mobile devices and voice-based search systems
Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling
This paper presents a Kernel Entity Salience Model (KESM) that improves text
understanding and retrieval by better estimating entity salience (importance)
in documents. KESM represents entities by knowledge enriched distributed
representations, models the interactions between entities and words by kernels,
and combines the kernel scores to estimate entity salience. The whole model is
learned end-to-end using entity salience labels. The salience model also
improves ad hoc search accuracy, providing effective ranking features by
modeling the salience of query entities in candidate documents. Our experiments
on two entity salience corpora and two TREC ad hoc search datasets demonstrate
the effectiveness of KESM over frequency-based and feature-based methods. We
also provide examples showing how KESM conveys its text understanding ability
learned from entity salience to search
Term selection in information retrieval
Systems trained on linguistically annotated data achieve strong performance for many
language processing tasks. This encourages the idea that annotations can improve any
language processing task if applied in the right way. However, despite widespread
acceptance and availability of highly accurate parsing software, it is not clear that ad
hoc information retrieval (IR) techniques using annotated documents and requests consistently
improve search performance compared to techniques that use no linguistic
knowledge. In many cases, retrieval gains made using language processing components,
such as part-of-speech tagging and head-dependent relations, are offset by significant
negative effects. This results in a minimal positive, or even negative, overall
impact for linguistically motivated approaches compared to approaches that do not use
any syntactic or domain knowledge.
In some cases, it may be that syntax does not reveal anything of practical importance
about document relevance. Yet without a convincing explanation for why linguistic
annotations fail in IR, the intuitive appeal of search systems that ‘understand’ text
can result in the repeated application, and mis-application, of language processing to
enhance search performance. This dissertation investigates whether linguistics can improve
the selection of query terms by better modelling the alignment process between
natural language requests and search queries. It is the most comprehensive work on
the utility of linguistic methods in IR to date.
Term selection in this work focuses on identification of informative query terms of
1-3 words that both represent the semantics of a request and discriminate between relevant
and non-relevant documents. Approaches to word association are discussed with
respect to linguistic principles, and evaluated with respect to semantic characterization
and discriminative ability. Analysis is organised around three theories of language that
emphasize different structures for the identification of terms: phrase structure theory,
dependency theory and lexicalism. The structures identified by these theories play
distinctive roles in the organisation of language. Evidence is presented regarding the
value of different methods of word association based on these structures, and the effect
of method and term combinations.
Two highly effective, novel methods for the selection of terms from verbose queries
are also proposed and evaluated. The first method focuses on the semantic phenomenon
of ellipsis with a discriminative filter that leverages diverse text features. The second
method exploits a term ranking algorithm, PhRank, that uses no linguistic information
and relies on a network model of query context. The latter focuses queries so that 1-5
terms in an unweighted model achieve better retrieval effectiveness than weighted IR
models that use up to 30 terms. In addition, unlike models that use a weighted distribution
of terms or subqueries, the concise terms identified by PhRank are interpretable by
users. Evaluation with newswire and web collections demonstrates that PhRank-based
query reformulation significantly improves performance of verbose queries up to 14%
compared to highly competitive IR models, and is at least as good for short, keyword
queries with the same models.
Results illustrate that linguistic processing may help with the selection of word associations
but does not necessarily translate into improved IR performance. Statistical
methods are necessary to overcome the limits of syntactic parsing and word adjacency
measures for ad hoc IR. As a result, probabilistic frameworks that discover, and make
use of, many forms of linguistic evidence may deliver small improvements in IR effectiveness,
but methods that use simple features can be substantially more efficient
and equally, or more, effective. Various explanations for this finding are suggested,
including the probabilistic nature of grammatical categories, a lack of homomorphism
between syntax and semantics, the impact of lexical relations, variability in collection
data, and systemic effects in language systems
Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval
Techniques that compare short text segments using dependency paths (or simply, paths) appear in a wide range of automated language processing applications including question answering (QA). However, few models in ad hoc information retrieval (IR) use paths for document ranking due to the prohibitive cost of parsing a retrieval collection. In this paper, we introduce a flexible notion of paths that describe chains of words on a dependency path. These chains, or catenae, are readily applied in standard IR models. Informative catenae are selected using supervised machine learning with linguistically informed features and compared to both non-linguistic terms and catenae selected heuristically with filters derived from work on paths. Automatically selected catenae of 1-2 words deliver significant performance gains on three TREC collections.
Leveraging Formulae and Text for Improved Math Retrieval
Large collections containing millions of math formulas are available online. Retrieving math expressions from these collections is challenging. Users can use formula, formula+text, or math questions to express their math information needs. The structural complexity of formulas requires specialized processing. Despite the existence of math search systems and online community question-answering websites for math, little is known about mathematical information needs. This research first explores the characteristics of math searches using a general search engine. The findings show how math searches are different from general searches. Then, test collections for math-aware search are introduced. The ARQMath test collections have two main tasks: 1) finding answers for math questions and 2) contextual formula search. In each test collection (ARQMath-1 to -3) the same collection is used, Math Stack Exchange posts from 2010 to 2018, introducing different topics for each task. Compared to the previous test collections, ARQMath has a much larger number of diverse topics, and improved evaluation protocol. Another key role of this research is to leverage text and math information for improved math information retrieval. Three formula search models that only use the formula, with no context are introduced. The first model is an n-gram embedding model using both symbol layout tree and operator tree representations. The second model uses tree-edit distance to re-rank the results from the first model. Finally, a learning-to-rank model that leverages full-tree, sub-tree, and vector similarity scores is introduced. To use context, Math Abstract Meaning Representation (MathAMR) is introduced, which generalizes AMR trees to include math formula operations and arguments. This MathAMR is then used for contextualized formula search using a fine-tuned Sentence-BERT model. The experiments show tree-edit distance ranking achieves the current state-of-the-art results on contextual formula search task, and the MathAMR model can be beneficial for re-ranking. This research also addresses the answer retrieval task, introducing a two-step retrieval model in which similar questions are first found and then answers previously given to those similar questions are ranked. The proposed model, fine-tunes two Sentence-BERT models, one for finding similar questions and another one for ranking the answers. For Sentence-BERT model, raw text as well as MathAMR are used
- …