883 research outputs found
Intelligent Agents for Active Malware Analysis
The main contribution of this thesis is to give a novel perspective on Active Malware Analysis modeled as a decision making process between intelligent agents. We propose solutions aimed at extracting the behaviors of malware agents with advanced Artificial Intelligence techniques. In particular, we devise novel action selection strategies for the analyzer agents that allow to analyze malware by selecting sequences of triggering actions aimed at maximizing the information acquired. The goal is to create informative models representing the behaviors of the malware agents observed while interacting with them during the analysis process. Such models can then be used to effectively compare a malware against others and to correctly identify the malware famil
A Bayesian Model Combination-based approach to Active Malware Analysis
Active Malware Analysis involves modeling malware behavior by executing
actions to trigger responses and explore multiple execution paths. One of the
aims is making the action selection more efficient. This paper treats Active
Malware Analysis as a Bayes-Active Markov Decision Process and uses a Bayesian
Model Combination approach to train an analyzer agent. We show an improvement
in performance against other Bayesian and stochastic approaches to Active
Malware Analysis
Bayesian Active Malware Analysis
We propose a novel technique for Active Malware Analysis (AMA) formalized as a Bayesian game between an analyzer agent and a malware agent, focusing on the decision making strategy for the analyzer. In our model, the analyzer performs an action on the system to trigger the malware into showing a malicious behavior, i.e., by activating its payload. The formalization is built upon the link between malware families and the notion of types in Bayesian games. A key point is the design of the utility function, which reflects the amount of uncertainty on the type of the adversary after the execution of an analyzer action. This allows us to devise an algorithm to play the game with the aim of minimizing the entropy of the analyzer's belief at every stage of the game in a myopic fashion. Empirical evaluation indicates that our approach results in a significant improvement both in terms of learning speed and classification score when compared to other state-of-the-art AMA techniques
Active Learning of Points-To Specifications
When analyzing programs, large libraries pose significant challenges to
static points-to analysis. A popular solution is to have a human analyst
provide points-to specifications that summarize relevant behaviors of library
code, which can substantially improve precision and handle missing code such as
native code. We propose ATLAS, a tool that automatically infers points-to
specifications. ATLAS synthesizes unit tests that exercise the library code,
and then infers points-to specifications based on observations from these
executions. ATLAS automatically infers specifications for the Java standard
library, and produces better results for a client static information flow
analysis on a benchmark of 46 Android apps compared to using existing
handwritten specifications
Lookahead Pathology in Monte-Carlo Tree Search
Monte-Carlo Tree Search (MCTS) is an adversarial search paradigm that first
found prominence with its success in the domain of computer Go. Early
theoretical work established the game-theoretic soundness and convergence
bounds for Upper Confidence bounds applied to Trees (UCT), the most popular
instantiation of MCTS; however, there remain notable gaps in our understanding
of how UCT behaves in practice. In this work, we address one such gap by
considering the question of whether UCT can exhibit lookahead pathology -- a
paradoxical phenomenon first observed in Minimax search where greater search
effort leads to worse decision-making. We introduce a novel family of synthetic
games that offer rich modeling possibilities while remaining amenable to
mathematical analysis. Our theoretical and experimental results suggest that
UCT is indeed susceptible to pathological behavior in a range of games drawn
from this family
Beyond Random Split for Assessing Statistical Model Performance
Even though a train/test split of the dataset randomly performed is a common
practice, could not always be the best approach for estimating performance
generalization under some scenarios. The fact is that the usual machine
learning methodology can sometimes overestimate the generalization error when a
dataset is not representative or when rare and elusive examples are a
fundamental aspect of the detection problem. In the present work, we analyze
strategies based on the predictors' variability to split in training and
testing sets. Such strategies aim at guaranteeing the inclusion of rare or
unusual examples with a minimal loss of the population's representativeness and
provide a more accurate estimation about the generalization error when the
dataset is not representative. Two baseline classifiers based on decision trees
were used for testing the four splitting strategies considered. Both
classifiers were applied on CTU19 a low-representative dataset for a network
security detection problem. Preliminary results showed the importance of
applying the three alternative strategies to the Monte Carlo splitting strategy
in order to get a more accurate error estimation on different but feasible
scenarios
Agent Behavioral Analysis Based on Absorbing Markov Chains
We propose a novel technique to identify known behaviors of intelligent agents acting within uncertain environments. We employ Markov chains to represent the observed behavioral models of the agents and we formulate the problem as a classification task. In particular, we propose to use the long-term transition probability values of moving between states of the Markov chain as features. Additionally, we transform our models into absorbing Markov chains, enabling the use of standard techniques to compute such features. The empirical evaluation considers two scenarios: the identification of given strategies in classical games, and the detection of malicious behaviors in malware analysis. Results show that our approach can provide informative features to successfully identify known behavioral patterns. In more detail, we show that focusing on the long-term transition probability enables to diminish the error introduced by noisy states and transitions that may be present in an observed behavioral model. We pose particular attention to the case of noise that may be intentionally introduced by a target agent to deceive an observer agent
Latent Representation and Sampling in Network: Application in Text Mining and Biology.
In classical machine learning, hand-designed features are used for learning a mapping from raw data. However, human involvement in feature design makes the process expensive. Representation learning aims to learn abstract features directly from data without direct human involvement. Raw data can be of various forms. Network is one form of data that encodes relational structure in many real-world domains. Therefore, learning abstract features for network units is an important task. In this dissertation, we propose models for incorporating temporal information given as a collection of networks from subsequent time-stamps. The primary objective of our models is to learn a better abstract feature representation of nodes and edges in an evolving network. We show that the temporal information in the abstract feature improves the performance of link prediction task substantially. Besides applying to the network data, we also employ our models to incorporate extra-sentential information in the text domain for learning better representation of sentences. We build a context network of sentences to capture extra-sentential information. This information in abstract feature representation of sentences improves various text-mining tasks substantially over a set of baseline methods. A problem with the abstract features that we learn is that they lack interpretability. In real-life applications on network data, for some tasks, it is crucial to learn interpretable features in the form of graphical structures. For this we need to mine important graphical structures along with their frequency statistics from the input dataset. However, exact algorithms for these tasks are computationally expensive, so scalable algorithms are of urgent need. To overcome this challenge, we provide efficient sampling algorithms for mining higher-order structures from network(s). We show that our sampling-based algorithms are scalable. They are also superior to a set of baseline algorithms in terms of retrieving important graphical sub-structures, and collecting their frequency statistics. Finally, we show that we can use these frequent subgraph statistics and structures as features in various real-life applications. We show one application in biology and another in security. In both cases, we show that the structures and their statistics significantly improve the performance of knowledge discovery tasks in these domains
- …