5 research outputs found
Paq: 65 million probably-asked questions and what you can do with them
Open-domain Question Answering models that directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared with conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models fall short of the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) while retaining high accuracy. Lastly, we demonstrate RePAQ’s strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to ‘‘back-off’’ to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone
Domain-matched Pre-training Tasks for Dense Retrieval
Pre-training on larger datasets with ever increasing model size is now a proven recipe for increased performance across almost all NLP tasks. A notable exception is information retrieval, where additional pre-training has so far failed to produce convincing results. We show that, with the right pre-training setup, this barrier can be overcome. We demonstrate this by pre-training large bi-encoder models on 1) a recently released set of 65 million synthetically generated questions, and 2) 200 million post-comment pairs from a preexisting dataset of Reddit conversations. We evaluate on a set of information retrieval and dialogue retrieval benchmarks, showing substantial improvements over supervised baselines
KILT: a Benchmark for Knowledge Intensive Language Tasks.
Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at https://github.com/facebookresearc
Generating fact checking briefs
Fact checking at scale is difficult -- while the number of active fact
checking websites is growing, it remains too small for the needs of the
contemporary media ecosystem. However, despite good intentions, contributions
from volunteers are often error-prone, and thus in practice restricted to claim
detection. We investigate how to increase the accuracy and efficiency of fact
checking by providing information about the claim before performing the check,
in the form of natural language briefs. We investigate passage-based briefs,
containing a relevant passage from Wikipedia, entity-centric ones consisting of
Wikipedia pages of mentioned entities, and Question-Answering Briefs, with
questions decomposing the claim, and their answers. To produce QABriefs, we
develop QABriefer, a model that generates a set of questions conditioned on the
claim, searches the web for evidence, and generates answers. To train its
components, we introduce QABriefDataset which we collected via crowdsourcing.
We show that fact checking with briefs -- in particular QABriefs -- increases
the accuracy of crowdworkers by 10% while slightly decreasing the time taken.
For volunteer (unpaid) fact checkers, QABriefs slightly increase accuracy and
reduce the time required by around 20%
How Context Affects Language Models' Factual Predictions
When pre-trained on large unsupervised textual corpora, language models are able to
store and retrieve factual knowledge to some extent, making it possible to use them directly for zero-shot cloze-style question answering. However, storing factual knowledge in a
fixed number of weights of a language model clearly has limitations. Previous approaches
have successfully provided access to information outside the model weights using supervised architectures that combine an information retrieval system with a machine reading
component. In this paper, we go a step further and integrate information from a retrieval
system with a pre-trained language model in a purely unsupervised way. We report that
augmenting pre-trained language models in this way dramatically improves performance
and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline. Furthermore, processing query and context with different
segment tokens allows BERT to utilize its Next Sentence Prediction pre-trained classifier
to determine whether the context is relevant or not, substantially improving BERT’s zeroshot cloze-style question-answering performance and making its predictions robust to noisy
contexts