The potential to provide patients with faster information access while
allowing medical specialists to concentrate on critical tasks makes medical
domain dialog agents appealing. However, the integration of large-language
models (LLMs) into these agents presents certain limitations that may result in
serious consequences. This paper investigates the challenges and risks of using
GPT-3-based models for medical question-answering (MedQA). We perform several
evaluations contextualized in terms of standard medical principles. We provide
a procedure for manually designing patient queries to stress-test high-risk
limitations of LLMs in MedQA systems. Our analysis reveals that LLMs fail to
respond adequately to these queries, generating erroneous medical information,
unsafe recommendations, and content that may be considered offensive.Comment: 12 pages, 9 Tables, accepted to RANLP 202