Artificial Intelligence Generated Content (AIGC) has garnered considerable
attention for its impressive performance, with ChatGPT emerging as a leading
AIGC model that produces high-quality responses across various applications,
including software development and maintenance. Despite its potential, the
misuse of ChatGPT poses significant concerns, especially in education and
safetycritical domains. Numerous AIGC detectors have been developed and
evaluated on natural language data. However, their performance on code-related
content generated by ChatGPT remains unexplored. To fill this gap, in this
paper, we present the first empirical study on evaluating existing AIGC
detectors in the software domain. We created a comprehensive dataset including
492.5K samples comprising code-related content produced by ChatGPT,
encompassing popular software activities like Q&A (115K), code summarization
(126K), and code generation (226.5K). We evaluated six AIGC detectors,
including three commercial and three open-source solutions, assessing their
performance on this dataset. Additionally, we conducted a human study to
understand human detection capabilities and compare them with the existing AIGC
detectors. Our results indicate that AIGC detectors demonstrate lower
performance on code-related data compared to natural language data. Fine-tuning
can enhance detector performance, especially for content within the same
domain; but generalization remains a challenge. The human evaluation reveals
that detection by humans is quite challenging