35 research outputs found

    MuSiQue: Multihop Questions via Single-hop Question Composition

    Full text link
    Multihop reasoning remains an elusive goal as existing multihop benchmarks are known to be largely solvable via shortcuts. Can we create a question answering (QA) dataset that, by construction, \emph{requires} proper multihop reasoning? To this end, we introduce a bottom-up approach that systematically selects composable pairs of single-hop questions that are connected, i.e., where one reasoning step critically relies on information from another. This bottom-up methodology lets us explore a vast space of questions and add stringent filters as well as other mechanisms targeting connected reasoning. It provides fine-grained control over the construction process and the properties of the resulting kk-hop questions. We use this methodology to create MuSiQue-Ans, a new multihop QA dataset with 25K 2-4 hop questions. Relative to existing datasets, MuSiQue-Ans is more difficult overall (3x increase in human-machine gap), and harder to cheat via disconnected reasoning (e.g., a single-hop model has a 30 point drop in F1). We further add unanswerable contrast questions to produce a more stringent dataset, MuSiQue-Full. We hope our datasets will help the NLP community develop models that perform genuine multihop reasoning.Comment: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 202

    Investigating the Gap Between Single-Hop and Multi-Hop Questions in Closed-Book Question Answering via Question Decomposition

    Get PDF
    Distributed Computing and Artificial Intelligence, Special Sessions I, 20th International Conference, DCAI 2023, 12-14 July, Guimaraes, PortugalTransformer-based language models (LMs) have been shown to perform question answering (QA) competitively even when removing context and using only questions as input (called closed-book QA). Previous work that studied closed-book has mainly used simple questions that require a single reasoning step (i.e., single-hop questions). In this study, we find that using multi-hop questions requiring multiple reasoning steps drastically drops the performance. We investigate how to close this gap using two methods: fine-tuning with explicit question decomposition using three decomposition systems, or few-shot learning with chain-of-thoughts (CoT) for implicit question decomposition. We experiment on three multi-hop datasets, considering different multi-hop question types (i.e., compositional, comparison, etc.). We demonstrate when the methods fail and identify future directions that are most promising to closing the gap between single-hop and multi-hop closed-book QA. We release the code: https://github.com/talkhaldi/mh_cbqa
    corecore