42 research outputs found

    Efficient Identification of Timed Automata: Theory and practice

    No full text
    This thesis contains a study in a subfield of artificial intelligence, learning theory, machine learning, and statistics, known as system (or language) identification. System identification is concerned with constructing (mathematical) models from observations. Such a model is an intuitive description of a complex system. One of the main nice properties of models is that they can be visualized and inspected in order to provide insight into the different behaviors of a system. In addition, they can be used to perform different calculations, such as making predictions, analyzing properties, diagnosing errors, performing simulations, and many more. Models are therefore extremely useful tools for understanding, interpreting, and modifying different kinds of systems. Unfortunately, it can be very difficult to construct a model by hand. This thesis investigates the difficulty of automatically identifying models from observations. Observations of some process and its environment are given. These observations form sequences of events. Using system identification, we try to discover the logical structure underlying these event sequences. A well-known model of such a logical structure is the deterministic finite state automaton (DFA). A DFA is a language model. Hence, its identification (or inference) problem has been well studied in the grammatical inference field. Knowing this, we want to take an established method to learn a DFA and apply it to our event sequences. However, when observing a system there often is more information than just the sequence of symbols (events): the time at which these symbols occur is also available. A DFA can be used to model this time information implicitly. A disadvantage of such an approach is that it can result in an exponential blowup of both the input data and the resulting size of the model. In this thesis, we propose a different method that uses the time information directly in order to produce a timed model. We use a well-known DFA variant that includes the notion of time, called the timed automaton (TA). TAs are commonly used to model and reason about real-time systems. A TA models the timed information explicitly, i.e., using numbers. Because numbers use a binary representation of time, such an explicit representation can result in exponentially more compact models than an implicit representation. Therefore, also the time, space, and data required to identify TAs can be exponentially smaller than the time, space, and data required to identify DFAs. This efficiency argument is our main reason we are interested in identifying TAs. The work in this thesis makes four major contributions to the state-of-the-art on this topic: 1. It contains a thorough theoretical study of the complexity of identifying TAs from data. 2. It provides an algorithm for identifying a simple TA from labeled data, i.e., from event sequences for which it is known to which type of system behavior they belong. 3. It extends this algorithm to the setting of unlabeled data, i.e., from event sequences with unknown behaviors. 4. It shows how to apply this algorithm to the problem of identifying a real-time monitoring system. These contributions are of importance for anyone who is interested in identifying timed systems. Most importantly, both in our theoretical work and in our experiments we show that identifying a TA by using the time information directly is more efficient than identifying an equivalent DFA. In addition, our techniques can be applied to many interesting problems due to their generality. Examples are gaining insight into a real-time process, recognizing different process behaviors, identifying process models, and analyzing black-box systems.Software TechnologyElectrical Engineering, Mathematics and Computer Scienc

    SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting

    No full text
    Sequence clustering in a streaming environment is challenging because it is computationally expensive, and the sequences may evolve over time. K-medoids or Partitioning Around Medoids (PAM) is commonly used to cluster sequences since it supports alignment-based distances, and the k-centers being actual data items helps with cluster interpretability. However, offline k-medoids has no support for concept drift, while also being prohibitively expensive for clustering data streams. We therefore propose SECLEDS, a streaming variant of the k-medoids algorithm with constant memory footprint. SECLEDS has two unique properties: i) it uses multiple medoids per cluster, producing stable highquality clusters, and ii) it handles concept drift using an intuitive Medoid Voting scheme for approximating cluster distances. Unlike existing adaptive algorithms that create new clusters for new concepts, SECLEDS follows a fundamentally different approach, where the clusters themselves evolve with an evolving stream. Using real and synthetic datasets, we empirically demonstrate that SECLEDS produces high-quality clusters regardless of drift, stream size, data dimensionality, and number of clusters. We compare against three popular stream and batch clustering algorithms. The state-of-the-art BanditPAM is used as an offline benchmark. SECLEDS achieves comparable F1 score to BanditPAM while reducing the number of required distance computations by 83.7%. Importantly, SECLEDS outperforms all baselines by 138.7% when the stream contains drift. We also cluster real network traffic, and provide evidence that SECLEDS can support network bandwidths of up to 1.08 Gbps while using the (expensive) dynamic time warping distance.Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Cyber Securit

    Adversarially Robust Decision Tree Relabeling

    No full text
    Decision trees are popular models for their interpretation properties and their success in ensemble models for structured data. However, common decision tree learning algorithms produce models that suffer from adversarial examples. Recent work on robust decision tree learning mitigates this issue by taking adversarial perturbations into account during training. While these methods generate robust shallow trees, their relative quality reduces when training deeper trees due the methods being greedy. In this work we propose robust relabeling, a post-learning procedure that optimally changes the prediction labels of decision tree leaves to maximize adversarial robustness. We show this can be achieved in polynomial time in terms of the number of samples and leaves. Our results on 10 datasets show a significant improvement in adversarial accuracy both for single decision trees and tree ensembles. Decision trees and random forests trained with a state-of-the-art robust learning algorithm also benefited from robust relabeling.Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Cyber Securit

    Learning Decision Trees with Flexible Constraints and Objectives Using Integer Optimization

    No full text
    We encode the problem of learning the optimal decision tree of a given depth as an integer optimization problem. We show experimentally that our method (DTIP) can be used to learn good trees up to depth 5 from data sets of size up to 1000. In addition to being efficient, our new formulation allows for a lot of flexibility. Experiments show that we can use the trees learned from any existing decision tree algorithms as starting solutions and improve the trees using DTIP. Moreover, the proposed formulation allows us to easily create decision trees with different optimization objectives instead of accuracy and error, and constraints can be added explicitly during the tree construction phase. We show how this flexibility can be used to learn discrimination-aware classification trees, to improve learning from imbalanced data, and to learn trees that minimise false positive/negative errors.Accepted author manuscriptCyber Securit

    Efficient Training of Robust Decision Trees Against Adversarial Examples

    No full text
    Recently it has been shown that many machine learning models are vulnerable to adversarial examples: perturbed samples that trick the model into misclassifying them. Neural networks have received much attention but decision trees and their ensembles achieve state-of-the-art results on tabular data, motivating research on their robustness. Recently the first methods have been proposed to train decision trees and their ensembles robustly [4, 3, 2, 1] but the state-of-the-art methods are expensive to run.Cyber Securit

    Optimal Decision Tree Policies for Markov Decision Processes

    No full text
    Interpretability of reinforcement learning policies is essential for many real-world tasks but learning such interpretable policies is a hard problem. Particularly, rule-based policies such as decision trees and rules lists are difficult to optimize due to their non-differentiability. While existing techniques can learn verifiable decision tree policies, there is no guarantee that the learners generate a policy that performs optimally. In this work, we study the optimization of size-limited decision trees for Markov Decision Processes (MPDs) and propose OMDTs: Optimal MDP Decision Trees. Given a user-defined size limit and MDP formulation, OMDT directly maximizes the expected discounted return for the decision tree using Mixed-Integer Linear Programming. By training optimal tree policies for different MDPs we empirically study the optimality gap for existing imitation learning techniques and find that they perform sub-optimally. We show that this is due to an inherent shortcoming of imitation learning, namely that complex policies cannot be represented using size-limited trees. In such cases, it is better to directly optimize the tree for expected return. While there is generally a trade-off between the performance and interpretability of machine learning models, we find that on small MDPs, depth 3 OMDTs often perform close to optimally.Cyber Securit

    Vulnerability Detection on Mobile Applications Using State Machine Inference

    No full text
    Although the importance of mobile applications grows every day, recent vulnerability reports argue the application's deficiency to meet modern security standards. Testing strategies alleviate the problem by identifying security violations in software implementations. This paper proposes a novel testing methodology that applies state machine learning of mobile Android applications in combination with algorithms that discover attack paths in the learned state machine. The presence of an attack path evidences the existence of a vulnerability in the mobile application. We apply our methods to real-life apps and show that the novel methodology is capable of identifying vulnerabilities.Accepted author manuscriptCyber Securit

    Hybrid connection and host clustering for community detection in spatial-temporal network data

    No full text
    Network data clustering and sequential data mining are largefields of research, but how to combine them to analyze spatial-temporalnetwork data remains a technical challenge. This study investigates anovel combination of two sequential similarity methods (Dynamic TimeWarping and N-grams with Cosine distances), with two state-of-the-artunsupervised network clustering algorithms (Hierarchical Density-basedClustering and Stochastic Block Models). A popular way to combine suchmethods is to first cluster the sequential network data, resulting in connection types. The hosts in the network can then be clustered conditionedon these types. In contrast, our approach clusters nodes and edges in onego, i.e., without giving the output of a first clustering step as input for asecond step. We achieve this by implementing sequential distances as covariates for host clustering. While being fully unsupervised, our methodoutperforms many existing approaches. To the best of our knowledge, theonly approaches with comparable performance require manual filteringof connections and feature engineering steps. In contrast, our method isapplied to raw network traffic. We apply our pipeline to the problem ofdetecting infected hosts (network nodes) from logs of unlabelled networktraffic (sequential data). On data from the Stratosphere IPS project (CTUMalware-Capture-Botnet-91), which includes malicious (Conficker botnet) as well as benign hosts, we show that our method perfectly detectsperipheral, benign, and malicious hosts in different clusters. We replicate our results in the well-known ISOT dataset (Storm, Waledac, Zeusbotnets) with comparable performance: conjointly, 99.97% of nodes werecategorized correctlyCyber Securit

    An algorithm for learning real-time automata

    No full text
    We describe an algorithm for learning simple timed automata, known as real-time automata. The transitions of real-time automata can have a temporal constraint on the time of occurrence of the current symbol relative to the previous symbol. The learning algorithm is similar to the redblue fringe state-merging algorithm for the problem of learning deterministic finite automata. In addition to state merges, our algorithm can perform state splits by making use of the time values in the input data. We tested our learning algorithm on randomly generated problems. The results are promising and show that learning a real-time automaton directly from timed data outperforms a method that uses sampling in order to deal with the timed data.Software Computer TechnologyElectrical Engineering, Mathematics and Computer Scienc

    Timed Automata for Behavioral Pattern Recognition

    No full text
    We argue that timed models are a suitable framework for the detection of behavior in real-world event systems. A timed model which detects behavior is constructible by a domain expert. The inference of these timed models from data is a hard problem. We prove the inference of a class of timed automata (event recording automata) to be harder than the inference of finite automata.Software Computer TechnologyElectrical Engineering, Mathematics and Computer Scienc
    corecore