Domain adaptation, the process of training a model in one domain and applying
it to another, has been extensively explored in machine learning. While
training a domain-specific foundation model (FM) from scratch is an option,
recent methods have focused on adapting pre-trained FMs for domain-specific
tasks. However, our experiments reveal that either approach does not
consistently achieve state-of-the-art (SOTA) results in the target domain. In
this work, we study extractive question answering within closed domains and
introduce the concept of targeted pre-training. This involves determining and
generating relevant data to further pre-train our models, as opposed to the
conventional philosophy of utilizing domain-specific FMs trained on a wide
range of data. Our proposed framework uses Galactica to generate synthetic,
``targeted'' corpora that align with specific writing styles and topics, such
as research papers and radiology reports. This process can be viewed as a form
of knowledge distillation. We apply our method to two biomedical extractive
question answering datasets, COVID-QA and RadQA, achieving a new benchmark on
the former and demonstrating overall improvements on the latter. Code available
at https://github.com/saptarshi059/CDQA-v1-Targetted-PreTraining/tree/main