12 research outputs found
Forecasting number of vulnerabilities using long short-term neural memory network
Cyber-attacks are launched through the exploitation of some existing vulnerabilities in the software, hardware, system and/or network. Machine learning algorithms can be used to forecast the number of post release vulnerabilities. Traditional neural networks work like a black box approach; hence it is unclear how reasoning is used in utilizing past data points in inferring the subsequent data points. However, the long short-term memory network (LSTM), a variant of the recurrent neural network, is able to address this limitation by introducing a lot of loops in its network to retain and utilize past data points for future calculations. Moving on from the previous finding, we further enhance the results to predict the number of vulnerabilities by developing a time series-based sequential model using a long short-term memory neural network. Specifically, this study developed a supervised machine learning based on the non-linear sequential time series forecasting model with a long short-term memory neural network to predict the number of vulnerabilities for three vendors having the highest number of vulnerabilities published in the national vulnerability database (NVD), namely microsoft, IBM and oracle. Our proposed model outperforms the existing models with a prediction result root mean squared error (RMSE) of as low as 0.072
Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs
Binary code analysis allows analyzing binary code without having access to
the corresponding source code. A binary, after disassembly, is expressed in an
assembly language. This inspires us to approach binary analysis by leveraging
ideas and techniques from Natural Language Processing (NLP), a rich area
focused on processing text of various natural languages. We notice that binary
code analysis and NLP share a lot of analogical topics, such as semantics
extraction, summarization, and classification. This work utilizes these ideas
to address two important code similarity comparison problems. (I) Given a pair
of basic blocks for different instruction set architectures (ISAs), determining
whether their semantics is similar or not; and (II) given a piece of code of
interest, determining if it is contained in another piece of assembly code for
a different ISA. The solutions to these two problems have many applications,
such as cross-architecture vulnerability discovery and code plagiarism
detection. We implement a prototype system INNEREYE and perform a comprehensive
evaluation. A comparison between our approach and existing approaches to
Problem I shows that our system outperforms them in terms of accuracy,
efficiency and scalability. And the case studies utilizing the system
demonstrate that our solution to Problem II is effective. Moreover, this
research showcases how to apply ideas and techniques from NLP to large-scale
binary code analysis.Comment: Accepted by Network and Distributed Systems Security (NDSS) Symposium
201
Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories
The lack of comprehensive sources of accurate vulnerability data represents a
critical obstacle to studying and understanding software vulnerabilities (and
their corrections). In this paper, we present an approach that combines
heuristics stemming from practical experience and machine-learning (ML) -
specifically, natural language processing (NLP) - to address this problem. Our
method consists of three phases. First, an advisory record containing key
information about a vulnerability is extracted from an advisory (expressed in
natural language). Second, using heuristics, a subset of candidate fix commits
is obtained from the source code repository of the affected project by
filtering out commits that are known to be irrelevant for the task at hand.
Finally, for each such candidate commit, our method builds a numerical feature
vector reflecting the characteristics of the commit that are relevant to
predicting its match with the advisory at hand. The feature vectors are then
exploited for building a final ranked list of candidate fixing commits. The
score attributed by the ML model to each feature is kept visible to the users,
allowing them to interpret of the predictions.
We evaluated our approach using a prototype implementation named Prospector
on a manually curated data set that comprises 2,391 known fix commits
corresponding to 1,248 public vulnerability advisories. When considering the
top-10 commits in the ranked results, our implementation could successfully
identify at least one fix commit for up to 84.03% of the vulnerabilities (with
a fix commit on the first position for 65.06% of the vulnerabilities). In
conclusion, our method reduces considerably the effort needed to search OSS
repositories for the commits that fix known vulnerabilities
On the feasibility of automated prediction of bug and non-bug issues
Context
Issue tracking systems are used to track and describe tasks in the development process, e.g., requested feature improvements or reported bugs. However, past research has shown that the reported issue types often do not match the description of the issue.
Objective
We want to understand the overall maturity of the state of the art of issue type prediction with the goal to predict if issues are bugs and evaluate if we can improve existing models by incorporating manually specified knowledge about issues.
Method
We train different models for the title and description of the issue to account for the difference in structure between these fields, e.g., the length. Moreover, we manually detect issues whose description contains a null pointer exception, as these are strong indicators that issues are bugs.
Results
Our approach performs best overall, but not significantly different from an approach from the literature based on the fastText classifier from Facebook AI Research. The small improvements in prediction performance are due to structural information about the issues we used. We found that using information about the content of issues in form of null pointer exceptions is not useful. We demonstrate the usefulness of issue type prediction through the example of labelling bugfixing commits.
Conclusions
Issue type prediction can be a useful tool if the use case allows either for a certain amount of missed bug reports or the prediction of too many issues as bug is acceptable
On the feasibility of automated prediction of bug and non-bug issues
Context: Issue tracking systems are used to track and describe tasks in the
development process, e.g., requested feature improvements or reported bugs.
However, past research has shown that the reported issue types often do not
match the description of the issue.
Objective: We want to understand the overall maturity of the state of the art
of issue type prediction with the goal to predict if issues are bugs and
evaluate if we can improve existing models by incorporating manually specified
knowledge about issues.
Method: We train different models for the title and description of the issue
to account for the difference in structure between these fields, e.g., the
length. Moreover, we manually detect issues whose description contains a null
pointer exception, as these are strong indicators that issues are bugs.
Results: Our approach performs best overall, but not significantly different
from an approach from the literature based on the fastText classifier from
Facebook AI Research. The small improvements in prediction performance are due
to structural information about the issues we used. We found that using
information about the content of issues in form of null pointer exceptions is
not useful. We demonstrate the usefulness of issue type prediction through the
example of labelling bugfixing commits.
Conclusions: Issue type prediction can be a useful tool if the use case
allows either for a certain amount of missed bug reports or the prediction of
too many issues as bug is acceptable