In healthcare, multimodal data is prevalent and requires to be
comprehensively analyzed before diagnostic decisions, including medical images,
clinical reports, etc. However, current large-scale artificial intelligence
models predominantly focus on single-modal cognitive abilities and neglect the
integration of multiple modalities. Therefore, we propose Stone Needle, a
general multimodal large-scale model framework tailored explicitly for
healthcare applications. Stone Needle serves as a comprehensive medical
multimodal model foundation, integrating various modalities such as text,
images, videos, and audio to surpass the limitations of single-modal systems.
Through the framework components of intent analysis, medical foundation models,
prompt manager, and medical language module, our architecture can perform
multi-modal interaction in multiple rounds of dialogue. Our method is a general
multimodal large-scale model framework, integrating diverse modalities and
allowing us to tailor for specific tasks. The experimental results demonstrate
the superior performance of our method compared to single-modal systems. The
fusion of different modalities and the ability to process complex medical
information in Stone Needle benefits accurate diagnosis, treatment
recommendations, and patient care