Search CORE

257 research outputs found

Generating Synthetic Data for Neural Keyword-to-Question Models

Author: Bogdanova Dasha
Lin Chin-Yew
Mikolov Tomas
Ros German
Zheng Zhicheng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/07/2018
Field of study

Search typically relies on keyword queries, but these are often semantically ambiguous. We propose to overcome this by offering users natural language questions, based on their keyword queries, to disambiguate their intent. This keyword-to-question task may be addressed using neural machine translation techniques. Neural translation models, however, require massive amounts of training data (keyword-question pairs), which is unavailable for this task. The main idea of this paper is to generate large amounts of synthetic training data from a small seed set of hand-labeled keyword-question pairs. Since natural language questions are available in large quantities, we develop models to automatically generate the corresponding keyword queries. Further, we introduce various filtering mechanisms to ensure that synthetic training data is of high quality. We demonstrate the feasibility of our approach using both automatic and manual evaluation. This is an extended version of the article published with the same title in the Proceedings of ICTIR'18.Comment: Extended version of ICTIR'18 full paper, 11 page

arXiv.org e-Print Archive

Crossref

How to Ask Better Questions? A Large-Scale Multi-Domain Dataset for Rewriting Ill-Formed Questions

Author: Chen Jing
Chen Mingda
Chu Zewei
Faruqui Manaal
Gimpel Kevin
Si Xiance
Wang Miaosen
Publication venue
Publication date: 20/11/2019
Field of study

We present a large-scale dataset for the task of rewriting an ill-formed natural language question to a well-formed one. Our multi-domain question rewriting MQR dataset is constructed from human contributed Stack Exchange question edit histories. The dataset contains 427,719 question pairs which come from 303 domains. We provide human annotations for a subset of the dataset as a quality estimate. When moving from ill-formed to well-formed questions, the question quality improves by an average of 45 points across three aspects. We train sequence-to-sequence neural models on the constructed dataset and obtain an improvement of 13.2% in BLEU-4 over baseline methods built from other data resources. We release the MQR dataset to encourage research on the problem of question rewriting.Comment: AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Question Paraphrase Generation for Question Answering System

Author: Qin Haocheng
Publication venue: 'University of Waterloo'
Publication date: 18/08/2015
Field of study

The queries to a practical Question Answering (QA) system range from keywords, phrases, badly written questions, and occasionally grammatically perfect questions. Among different kinds of question analysis approaches, the pattern matching works well in analyzing such queries. It is costly to build this pattern matching module because tremendous manual labor is needed to expand its coverage to so many variations in natural language questions. This thesis proposes that the costly manual labor should be saved by the technique of paraphrase generation which can automatically generate semantically similar paraphrases of a natural language question. Previous approaches of paraphrase generation either require large scale of corpus and the dependency parser, or only deal with the relation-entity type of simple question queries. By introducing a method of inferring transformation operations between paraphrases, and a description of sentence structure, this thesis develops a paraphrase generation method and its implementation in Chinese with very limited amount of corpus. The evaluation results of this implementation show its ability to aid humans to efficiently create a pattern matching module for QA systems as it greatly outperforms the human editors in the coverage of natural language questions, with an acceptable precision in generated paraphrases

University of Waterloo's Institutional Repository

Look before you Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion

Author: Abujabal A.
Christmann P.
Saha Roy R.
Singh J.
Weikum G.
Publication venue
Publication date: 01/01/2019
Field of study

Fact-centric information needs are rarely one-shot; users typically ask follow-up questions to explore a topic. In such a conversational setting, the user's inputs are often incomplete, with entities or predicates left out, and ungrammatical phrases. This poses a huge challenge to question answering (QA) systems that typically rely on cues in full-fledged interrogative sentences. As a solution, we develop CONVEX: an unsupervised method that can answer incomplete questions over a knowledge graph (KG) by maintaining conversation context using entities and predicates seen so far and automatically inferring missing or ambiguous pieces for follow-up questions. The core of our method is a graph exploration algorithm that judiciously expands a frontier to find candidate answers for the current question. To evaluate CONVEX, we release ConvQuestions, a crowdsourced benchmark with 11,200 distinct conversations from five different domains. We show that CONVEX: (i) adds conversational support to any stand-alone QA system, and (ii) outperforms state-of-the-art baselines and question completion strategies

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

Author: Andoni A.
Beyer K.
Broder A. Z.
Brown P. F.
Fried D.
Le Q.
Mikolov T.
Mu Y.
Muja M.
Petrović S.
Riezler S.
Salton G.
Wang J.
Weber R.
Yang L.
Yao X.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/10/2016
Field of study

Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online

arXiv.org e-Print Archive

Crossref

Scipedia

A Deep Relevance Matching Model for Ad-hoc Retrieval

Author: Giles R. C. S. L. L.
Hu B.
Lu Z.
Mikolov T.
Qiu X.
Socher R.
Wan S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/11/2017
Field of study

In recent years, deep neural networks have led to exciting breakthroughs in speech recognition, computer vision, and natural language processing (NLP) tasks. However, there have been few positive results of deep models on ad-hoc retrieval tasks. This is partially due to the fact that many important characteristics of the ad-hoc retrieval task have not been well addressed in deep models yet. Typically, the ad-hoc retrieval task is formalized as a matching problem between two pieces of text in existing work using deep models, and treated equivalent to many NLP tasks such as paraphrase identification, question answering and automatic conversation. However, we argue that the ad-hoc retrieval task is mainly about relevance matching while most NLP matching tasks concern semantic matching, and there are some fundamental differences between these two matching tasks. Successful relevance matching requires proper handling of the exact matching signals, query term importance, and diverse matching requirements. In this paper, we propose a novel deep relevance matching model (DRMM) for ad-hoc retrieval. Specifically, our model employs a joint deep architecture at the query term level for relevance matching. By using matching histogram mapping, a feed forward matching network, and a term gating network, we can effectively deal with the three relevance matching factors mentioned above. Experimental results on two representative benchmark collections show that our model can significantly outperform some well-known retrieval models as well as state-of-the-art deep matching models.Comment: CIKM 2016, long pape

arXiv.org e-Print Archive

Crossref