Python has become the most popular programming language as it is friendly to
work with for beginners. However, a recent study has found that most security
issues in Python have not been indexed by CVE and may only be fixed by 'silent'
security commits, which pose a threat to software security and hinder the
security fixes to downstream software. It is critical to identify the hidden
security commits; however, the existing datasets and methods are insufficient
for security commit detection in Python, due to the limited data variety,
non-comprehensive code semantics, and uninterpretable learned features. In this
paper, we construct the first security commit dataset in Python, namely
PySecDB, which consists of three subsets including a base dataset, a pilot
dataset, and an augmented dataset. The base dataset contains the security
commits associated with CVE records provided by MITRE. To increase the variety
of security commits, we build the pilot dataset from GitHub by filtering
keywords within the commit messages. Since not all commits provide commit
messages, we further construct the augmented dataset by understanding the
semantics of code changes. To build the augmented dataset, we propose a new
graph representation named CommitCPG and a multi-attributed graph learning
model named SCOPY to identify the security commit candidates through both
sequential and structural code semantics. The evaluation shows our proposed
algorithms can improve the data collection efficiency by up to 40 percentage
points. After manual verification by three security experts, PySecDB consists
of 1,258 security commits and 2,791 non-security commits. Furthermore, we
conduct an extensive case study on PySecDB and discover four common security
fix patterns that cover over 85% of security commits in Python, providing
insight into secure software maintenance, vulnerability detection, and
automated program repair.Comment: Accepted to 2023 IEEE International Conference on Software
Maintenance and Evolution (ICSME