Recently, ChatGPT and GPT-4 have emerged and gained immense global attention
due to their unparalleled performance in language processing. Despite
demonstrating impressive capability in various open-domain tasks, their
adequacy in highly specific fields like radiology remains untested. Radiology
presents unique linguistic phenomena distinct from open-domain data due to its
specificity and complexity. Assessing the performance of large language models
(LLMs) in such specific domains is crucial not only for a thorough evaluation
of their overall performance but also for providing valuable insights into
future model design directions: whether model design should be generic or
domain-specific. To this end, in this study, we evaluate the performance of
ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned
specifically on task-related data samples. We also conduct a comprehensive
investigation on ChatGPT/GPT-4's reasoning ability by introducing varying
levels of inference difficulty. Our results show that 1) GPT-4 outperforms
ChatGPT in the radiology NLI task; 2) other specifically fine-tuned models
require significant amounts of data samples to achieve comparable performance
to ChatGPT/GPT-4. These findings demonstrate that constructing a generic model
that is capable of solving various tasks across different domains is feasible