Automatic Detection of Insecure Codes in Stack Overflow

Abstract

As the popularity of modern social coding paradigm such as Stack Overflow grows, its potential security risks increase as well (e.g., insecure codes could be easily embedded and distributed). To address this largely overlooked issue, we bring a new insight to exploit social coding properties in addition to code content for automatic detection of insecure code snippets in Stack Overflow. To determine if the given code snippets are insecure, we not only analyze the code content, but also utilize various kinds of relations among users, badges, questions, answers, code snippets and keywords in Stack Overflow. To model the rich semantic relationships, we first introduce a structured heterogeneous information network (HIN) for representation and then use meta-path based approach to incorporate higher-level semantics to build up relatedness over code snippets. Later, we propose two different novel network embedding models named Snippet2vec and CodeHin2Vec for representation learning in HIN to automate the insecure code snippet detection in Stack Overflow. More specifically, Snippet2vec learns the low dimensional representations for the nodes (i.e., code snippets) in the HIN where both the HIN structures and semantics are maximally preserved, while CodeHin2Vec utilizes HIN to depict relatedness over code snippets to generate code-to-code sequences, based on which sequence-to-sequence (seq2seq) concept in machine translation is further leveraged to learn representations of code snippets. Accordingly, we developed systems ICSD and iTrustSO which integrate our proposed methods respectively in insecure code snippet detection in Stack Overflow. Comprehensive experiments on the data collections from Stack Overflow are conducted to validate the effectiveness of our developed systems by comparisons with the state-of-the-art baseline methods

    Similar works