1,118 research outputs found
Feature Set Selection for Improved Classification of Static Analysis Alerts
With the extreme growth in third party cloud applications, increased exposure of applications to the internet, and the impact of successful breaches, improving the security of software being produced is imperative. Static analysis tools can alert to quality and security vulnerabilities of an application; however, they present developers and analysts with a high rate of false positives and unactionable alerts. This problem may lead to the loss of confidence in the scanning tools, possibly resulting in the tools not being used. The discontinued use of these tools may increase the likelihood of insecure software being released into production. Insecure software can be successfully attacked resulting in the compromise of one or several information security principles such as confidentiality, availability, and integrity.
Feature selection methods have the potential to improve the classification of static analysis alerts and thereby reduce the false positive rates. Thus, the goal of this research effort was to improve the classification of static analysis alerts by proposing and testing a novel method leveraging feature selection. The proposed model was developed and subsequently tested on three open source PHP applications spanning several years. The results were compared to a classification model utilizing all features to gauge the classification improvement of the feature selection model. The model presented did result in the improved classification accuracy and reduction of the false positive rate on a reduced feature set.
This work contributes a real-world static analysis dataset based upon three open source PHP applications. It also enhanced an existing data set generation framework to include additional predictive software features. However, the main contribution is a feature selection methodology that may be used to discover optimal feature sets that increase the classification accuracy of static analysis alerts
Neural Semantic Parsing for Syntax-Aware Code Generation
The task of mapping natural language expressions to logical forms is referred to as semantic parsing. The syntax of logical forms that are based on programming or query languages, such as Python or SQL, is defined by a formal grammar. In this thesis, we present an efficient neural semantic parser that exploits the underlying grammar of logical forms to enforce well-formed expressions. We use an encoder-decoder model for sequence prediction. Syntactically valid programs are guaranteed by means of a bottom-up shift-reduce parser, that keeps track of the set of viable tokens at each decoding step. We show that the proposed model outperforms the standard encoder-decoder model across datasets and is competitive with comparable grammar-guided semantic parsing approaches
Towards Fine-Grained Localization of Privacy Behaviors
Mobile applications are required to give privacy notices to users when they
collect or share personal information. Creating consistent and concise privacy
notices can be a challenging task for developers. Previous work has attempted
to help developers create privacy notices through a questionnaire or predefined
templates. In this paper, we propose a novel approach and a framework, called
PriGen, that extends these prior work. PriGen uses static analysis to identify
Android applications' code segments that process sensitive information (i.e.
permission-requiring code segments) and then leverages a Neural Machine
Translation model to translate them into privacy captions. We present the
initial evaluation of our translation task for ~300,000 code segments
Deep learning applied to the assessment of online student programming exercises
Massive online open courses (MOOCs) teaching coding are increasing in number and popularity. They commonly include homework assignments in which the students must write code that is evaluated by
functional tests. Functional testing may to some extent be automated
however provision of more qualitative evaluation and feedback may
be prohibitively labor-intensive. Provision of qualitative evaluation at
scale, automatically, is the subject of much research effort.
In this thesis, deep learning is applied to the task of performing
automatic assessment of source code, with a focus on provision of
qualitative feedback. Four tasks: language modeling, detecting idiomatic code, semantic code search, and predicting variable names are
considered in detail.
First, deep learning models are applied to the task of language modeling source code. A comparison is made between the performance of
different deep learning language models, and it is shown how language
models can be used for source code auto-completion. It is also demonstrated how language models trained on source code can be used for
transfer learning, providing improved performance on other tasks.
Next, an analysis is made on how the language models from the
previous task can be used to detect idiomatic code. It is shown that
these language models are able to locate where a student has deviated
from correct code idioms. These locations can be highlighted to the
student in order to provide qualitative feedback.
Then, results are shown on semantic code search, again comparing
the performance across a variety of deep learning models. It is demonstrated how semantic code search can be used to reduce the time taken
for qualitative evaluation, by automatically pairing a student submission with an instructor’s hand-written feedback.
Finally, it is examined how deep learning can be used to predict
variable names within source code. These models can be used in a
qualitative evaluation setting where the deep learning models can be
used to suggest more appropriate variable names. It is also shown that
these models can even be used to predict the presence of functional
errors.
Novel experimental results show that: fine-tuning a pre-trained
language model is an effective way to improve performance across a
variety of tasks on source code, improving performance by 5% on average; pre-trained language models can be used as zero-shot learners across a variety of tasks, with the zero-shot performance of some architectures outperforming the fine-tuned performance of others; and
that language models can be used to detect both semantic and syntactic errors. Other novel findings include: removing the non-variable
tokens within source code has negligible impact on the performance of
models, and that these remaining tokens can be shuffled with only a
minimal decrease in performance.Engineering and Physical Sciences Research Council (EPSRC) fundin
Use of Graph Neural Networks in Aiding Defensive Cyber Operations
In an increasingly interconnected world, where information is the lifeblood
of modern society, regular cyber-attacks sabotage the confidentiality,
integrity, and availability of digital systems and information. Additionally,
cyber-attacks differ depending on the objective and evolve rapidly to disguise
defensive systems. However, a typical cyber-attack demonstrates a series of
stages from attack initiation to final resolution, called an attack life cycle.
These diverse characteristics and the relentless evolution of cyber attacks
have led cyber defense to adopt modern approaches like Machine Learning to
bolster defensive measures and break the attack life cycle. Among the adopted
ML approaches, Graph Neural Networks have emerged as a promising approach for
enhancing the effectiveness of defensive measures due to their ability to
process and learn from heterogeneous cyber threat data. In this paper, we look
into the application of GNNs in aiding to break each stage of one of the most
renowned attack life cycles, the Lockheed Martin Cyber Kill Chain. We address
each phase of CKC and discuss how GNNs contribute to preparing and preventing
an attack from a defensive standpoint. Furthermore, We also discuss open
research areas and further improvement scopes.Comment: 35 pages, 9 figures, 8 table
- …