4,815 research outputs found
AI Solutions for MDS: Artificial Intelligence Techniques for Misuse Detection and Localisation in Telecommunication Environments
This report considers the application of Articial Intelligence (AI) techniques to
the problem of misuse detection and misuse localisation within telecommunications
environments. A broad survey of techniques is provided, that covers inter alia
rule based systems, model-based systems, case based reasoning, pattern matching,
clustering and feature extraction, articial neural networks, genetic algorithms, arti
cial immune systems, agent based systems, data mining and a variety of hybrid
approaches. The report then considers the central issue of event correlation, that
is at the heart of many misuse detection and localisation systems. The notion of
being able to infer misuse by the correlation of individual temporally distributed
events within a multiple data stream environment is explored, and a range of techniques,
covering model based approaches, `programmed' AI and machine learning
paradigms. It is found that, in general, correlation is best achieved via rule based approaches,
but that these suffer from a number of drawbacks, such as the difculty of
developing and maintaining an appropriate knowledge base, and the lack of ability
to generalise from known misuses to new unseen misuses. Two distinct approaches
are evident. One attempts to encode knowledge of known misuses, typically within
rules, and use this to screen events. This approach cannot generally detect misuses
for which it has not been programmed, i.e. it is prone to issuing false negatives.
The other attempts to `learn' the features of event patterns that constitute normal
behaviour, and, by observing patterns that do not match expected behaviour, detect
when a misuse has occurred. This approach is prone to issuing false positives,
i.e. inferring misuse from innocent patterns of behaviour that the system was not
trained to recognise. Contemporary approaches are seen to favour hybridisation,
often combining detection or localisation mechanisms for both abnormal and normal
behaviour, the former to capture known cases of misuse, the latter to capture
unknown cases. In some systems, these mechanisms even work together to update
each other to increase detection rates and lower false positive rates. It is concluded
that hybridisation offers the most promising future direction, but that a rule or state
based component is likely to remain, being the most natural approach to the correlation
of complex events. The challenge, then, is to mitigate the weaknesses of
canonical programmed systems such that learning, generalisation and adaptation
are more readily facilitated
CERT strategy to deal with phishing attacks
Every day, internet thieves employ new ways to obtain personal identity
people and get access to their personal information. Phishing is a somehow
complex method that has recently been considered by internet thieves.The
present study aims to explain phishing, and why an organization should deal
with it and its challenges of providing. In addition, different kinds of this
attack and classification of security approaches for organizational and lay
users are addressed in this article. Finally, the CERT strategy is presented to
deal with phishing and studying some anti-phishing
Towards understanding the challenges faced by machine learning software developers and enabling automated solutions
Modern software systems are increasingly including machine learning (ML) as an integral component. However, we do not yet understand the difficulties faced by software developers when learning about ML libraries and using them within their systems. To fill that gap this thesis reports on a detailed (manual) examination of 3,243 highly-rated Q&A posts related to ten ML libraries, namely Tensorflow, Keras, scikitlearn, Weka, Caffe, Theano, MLlib, Torch, Mahout, and H2O, on Stack Overflow, a popular online technical Q&A forum. Our findings reveal the urgent need for software engineering (SE) research in this area. The second part of the thesis particularly focuses on understanding the Deep Neural Network (DNN) bug characteristics. We study 2,716 high-quality posts from Stack Overflow and 500 bug fix commits from Github about five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand the types of bugs, their root causes and impacts, bug-prone stage of deep learning pipeline as well as whether there are some common antipatterns found in this buggy software. While exploring the bug characteristics, our findings imply that repairing software that uses DNNs is one such unmistakable SE need where automated tools could be beneficial; however, we do not fully understand challenges to repairing and patterns that are utilized when manually repairing DNNs. So, the third part of this thesis presents a comprehensive study of bug fix patterns to address these questions. We have studied 415 repairs from Stack Overflow and 555 repairs from Github for five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand challenges in repairs and bug repair patterns. Our key findings reveal that DNN bug fix patterns are distinctive compared to traditional bug fix patterns and the most common bug fix patterns are fixing data dimension and neural network connectivity. Finally, we propose an automatic technique to detect ML Application Programming Interface (API) misuses. We started with an empirical study to understand ML API misuses. Our study shows that ML API misuse is prevalent and distinct compared to non-ML API misuses. Inspired by these findings, we contributed Amimla (Api Misuse In Machine Learning Apis) an approach and a tool for ML API misuse detection. Amimla relies on several technical innovations. First, we proposed an abstract representation of ML pipelines to use in misuse detection. Second, we proposed an abstract representation of neural networks for deep learning related APIs. Third, we have developed a representation strategy for constraints on ML APIs. Finally, we have developed a misuse detection strategy for both single and multi-APIs. Our experimental evaluation shows that Amimla achieves a high average accuracy of ∼80% on two benchmarks of misuses from Stack Overflow and Github
Active Learning of Discriminative Subgraph Patterns for API Misuse Detection
A common cause of bugs and vulnerabilities are the violations of usage
constraints associated with Application Programming Interfaces (APIs). API
misuses are common in software projects, and while there have been techniques
proposed to detect such misuses, studies have shown that they fail to reliably
detect misuses while reporting many false positives. One limitation of prior
work is the inability to reliably identify correct patterns of usage. Many
approaches confuse a usage pattern's frequency for correctness. Due to the
variety of alternative usage patterns that may be uncommon but correct, anomaly
detection-based techniques have limited success in identifying misuses. We
address these challenges and propose ALP (Actively Learned Patterns),
reformulating API misuse detection as a classification problem. After
representing programs as graphs, ALP mines discriminative subgraphs. While
still incorporating frequency information, through limited human supervision,
we reduce the reliance on the assumption relating frequency and correctness.
The principles of active learning are incorporated to shift human attention
away from the most frequent patterns. Instead, ALP samples informative and
representative examples while minimizing labeling effort. In our empirical
evaluation, ALP substantially outperforms prior approaches on both MUBench, an
API Misuse benchmark, and a new dataset that we constructed from real-world
software projects
Evaluating Pre-trained Language Models for Repairing API Misuses
API misuses often lead to software bugs, crashes, and vulnerabilities. While
several API misuse detectors have been proposed, there are no automatic repair
tools specifically designed for this purpose. In a recent study,
test-suite-based automatic program repair (APR) tools were found to be
ineffective in repairing API misuses. Still, since the study focused on
non-learning-aided APR tools, it remains unknown whether learning-aided APR
tools are capable of fixing API misuses. In recent years, pre-trained language
models (PLMs) have succeeded greatly in many natural language processing tasks.
There is a rising interest in applying PLMs to APR. However, there has not been
any study that investigates the effectiveness of PLMs in repairing API misuse.
To fill this gap, we conduct a comprehensive empirical study on 11
learning-aided APR tools, which include 9 of the state-of-the-art
general-purpose PLMs and two APR tools. We evaluate these models with an
API-misuse repair dataset, consisting of two variants. Our results show that
PLMs perform better than the studied APR tools in repairing API misuses. Among
the 9 pre-trained models tested, CodeT5 is the best performer in the exact
match. We also offer insights and potential exploration directions for future
research.Comment: Under review by TOSE
Identifying and analyzing pointer misuses for sophisticated memory-corruption exploit diagnosis
Software exploits are one of the major threats to internet security. To quickly respond to these attacks, it is critical to automatically diagnose such exploits and find out how they circumvent existing defense mechanisms
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
- …