2 research outputs found
The TechQA Dataset
We introduce TechQA, a domain-adaptation question answering dataset for the
technical support domain. The TechQA corpus highlights two real-world issues
from the automated customer support domain. First, it contains actual questions
posed by users on a technical forum, rather than questions generated
specifically for a competition or a task. Second, it has a real-world size --
600 training, 310 dev, and 490 evaluation question/answer pairs -- thus
reflecting the cost of creating large labeled datasets with actual data.
Consequently, TechQA is meant to stimulate research in domain adaptation rather
than being a resource to build QA systems from scratch. The dataset was
obtained by crawling the IBM Developer and IBM DeveloperWorks forums for
questions with accepted answers that appear in a published IBM Technote---a
technical document that addresses a specific technical issue. We also release a
collection of the 801,998 publicly available Technotes as of April 4, 2019 as a
companion resource that might be used for pretraining, to learn representations
of the IT domain language.Comment: Long version of conference paper to be submitte