Multi-modal large language models (LLM) have achieved powerful capabilities
for visual semantic understanding in recent years. However, little is known
about how LLMs comprehend visual information and interpret different modalities
of features. In this paper, we propose a new method for identifying multi-modal
neurons in transformer-based multi-modal LLMs. Through a series of experiments,
We highlight three critical properties of multi-modal neurons by four
well-designed quantitative evaluation metrics. Furthermore, we introduce a
knowledge editing method based on the identified multi-modal neurons, for
modifying a specific token to another designative token. We hope our findings
can inspire further explanatory researches on understanding mechanisms of
multi-modal LLMs