11 research outputs found

    Learning probability distributions generated by finite-state machines

    Get PDF
    We review methods for inference of probability distributions generated by probabilistic automata and related models for sequence generation. We focus on methods that can be proved to learn in the inference in the limit and PAC formal models. The methods we review are state merging and state splitting methods for probabilistic deterministic automata and the recently developed spectral method for nondeterministic probabilistic automata. In both cases, we derive them from a high-level algorithm described in terms of the Hankel matrix of the distribution to be learned, given as an oracle, and then describe how to adapt that algorithm to account for the error introduced by a finite sample.Peer ReviewedPostprint (author's final draft

    Off-line compression by greedy textual substitution

    Full text link

    Data Structures and Algorithms for the String Statistics Problem

    Get PDF

    Effizientes Maschinelles Lernen fĂĽr die Angriffserkennung

    Get PDF
    Detecting and fending off attacks on computer systems is an enduring problem in computer security. In light of a plethora of different threats and the growing automation used by attackers, we are in urgent need of more advanced methods for attack detection. In this thesis, we address the necessity of advanced attack detection and develop methods to detect attacks using machine learning to establish a higher degree of automation for reactive security. Machine learning is data-driven and not void of bias. For the effective application of machine learning for attack detection, thus, a periodic retraining over time is crucial. However, the training complexity of many learning-based approaches is substantial. We show that with the right data representation, efficient algorithms for mining substring statistics, and implementations based on probabilistic data structures, training the underlying model can be achieved in linear time. In two different scenarios, we demonstrate the effectiveness of so-called language models that allow to generically portray the content and structure of attacks: On the one hand, we are learning malicious behavior of Flash-based malware using classification, and on the other hand, we detect intrusions by learning normality in industrial control networks using anomaly detection. With a data throughput of up to 580 Mbit/s during training, we do not only meet our expectations with respect to runtime but also outperform related approaches by up to an order of magnitude in detection performance. The same techniques that facilitate learning in the previous scenarios can also be used for revealing malicious content, embedded in passive file formats, such as Microsoft Office documents. As a further showcase, we additionally develop a method based on the efficient mining of substring statistics that is able to break obfuscations irrespective of the used key length, with up to 25 Mbit/s and thus, succeeds where related approaches fail. These methods significantly improve detection performance and enable operation in linear time. In doing so, we counteract the trend of compensating increasing runtime requirements with resources. While the results are promising and the approaches provide urgently needed automation, they cannot and are not intended to replace human experts or traditional approaches, but are designed to assist and complement them.Die Erkennung und Abwehr von Angriffen auf Endnutzer und Netzwerke ist seit vielen Jahren ein anhaltendes Problem in der Computersicherheit. Angesichts der hohen Anzahl an unterschiedlichen Angriffsvektoren und der zunehmenden Automatisierung von Angriffen, bedarf es dringend moderner Methoden zur Angriffserkennung. In dieser Doktorarbeit werden Ansätze entwickelt, um Angriffe mit Hilfe von Methoden des maschinellen Lernens zuverlässig, aber auch effizient zu erkennen. Sie stellen der Automatisierung von Angriffen einen entsprechend hohen Grad an Automatisierung von Verteidigungsmaßnahmen entgegen. Das Trainieren solcher Methoden ist allerdings rechnerisch aufwändig und erfolgt auf sehr großen Datenmengen. Laufzeiteffiziente Lernverfahren sind also entscheidend. Wir zeigen, dass durch den Einsatz von effizienten Algorithmen zur statistischen Analyse von Zeichenketten und Implementierung auf Basis von probabilistischen Datenstrukturen, das Lernen von effektiver Angriffserkennung auch in linearer Zeit möglich ist. Anhand von zwei unterschiedlichen Anwendungsfällen, demonstrieren wir die Effektivität von Modellen, die auf der Extraktion von sogenannten n-Grammen basieren: Zum einen, betrachten wir die Erkennung von Flash-basiertem Schadcode mittels Methoden der Klassifikation, und zum anderen, die Erkennung von Angriffen auf Industrienetzwerke bzw. SCADA-Systeme mit Hilfe von Anomaliedetektion. Dabei erzielen wir während des Trainings dieser Modelle einen Datendurchsatz von bis zu 580 Mbit/s und übertreffen gleichzeitig die Erkennungsleistung von anderen Ansätzen deutlich. Die selben Techniken, um diese lernenden Ansätze zu ermöglichen, können außerdem für die Erkennung von Schadcode verwendet werden, der in anderen Dateiformaten eingebettet und mittels einfacher Verschlüsselungen obfuskiert wurde. Hierzu entwickeln wir eine Methode die basierend auf der statistischen Auswertung von Zeichenketten einfache Verschlüsselungen bricht. Der entwickelte Ansatz arbeitet unabhängig von der verwendeten Schlüssellänge, mit einem Datendurchsatz von bis zu 25 Mbit/s und ermöglicht so die erfolgreiche Deobfuskierung in Fällen an denen andere Ansätze scheitern. Die erzielten Ergebnisse in Hinsicht auf Laufzeiteffizienz und Erkennungsleistung sind vielversprechend. Die vorgestellten Methoden ermöglichen die dringend nötige Automatisierung von Verteidigungsmaßnahmen, sollen den Experten oder etablierte Methoden aber nicht ersetzen, sondern diese unterstützen und ergänzen

    Automated Agent Ontology Creation for Distributed Databases

    Get PDF
    In distributed database environments, the combination of resources from multiple sources requiring different interfaces is a universal problem. The current solution requires an expert to generate an ontology, or mapping, which contains all interconnections between the various fields in the databases. This research proposes the application of software agents in automating the ontology creation for distributed database environments with minimal communication. The automatic creation of a domain ontology alleviates the need for experts to manually map one database to other databases in the environment. Using several combined comparison methods, these agents communicate and negotiate similarities between information sources and retain these similarities for client agent queries without the manual mapping of different data sources achieving an average accuracy of 57% before leader negotiation and an average accuracy of 61% after leader negotiation. The best matching accuracy achieved in a single test is 79%. This is directly applicable to the Department of Defense (DOD) that possesses many systems, which share information that enables the military to achieve their objectives. The DOD created an environment called the Joint Battlespace Infosphere (JBI) to solve this integration problem. This research improves upon the JBI\u27s use of exact matching of field names for integrating the information within the environment. It simulates this type of interaction by demonstrating agents wrapped around different databases negotiating and generating an ontology. An agent-generated ontology is compared with an expert generated ontology and testing uses a set of queries run against the ontologies show that this technique can be useful in a distributed information environment

    Ransomware detection based on opcode behaviour using k-nearest neighbours algorithm

    Get PDF
    Ransomware is a malware that represents a serious threat to a user’s information privacy. By investigating how ransomware works, we may be able to recognise its atomic behaviour. In return, we will be able to detect the ransomware at an earlier stage with better accuracy. In this paper, we propose Control Flow Graph (CFG) as an extracting opcode behaviour technique, combined with 4-gram (sequence of 4 “words”) to extract opcode sequence to be incorporated into Trojan Ransomware detection method using K-Nearest Neighbors (K-NN) algorithm. The opcode CFG 4-gram can fully represent the detailed behavioural characteristics of Trojan Ran-somware. The proposed ransomware detection method considers the closest distance to a previously identified ransomware pattern. Experimental results show that the proposed technique using K-NN, obtains the best accuracy of 98.86% for 1-gram opcode and using 1-NN classifier

    Substring-based Machine Translation

    Get PDF
    Abstract Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al (2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs

    Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

    Get PDF
    The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesn’t even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all. Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents. As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search. For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem. For the guided patent search, we demonstrate various methods to expand a user’s initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible

    Self-Alignment in Words and their Applications

    No full text
    Some quantities associated with periodicities in words are analyzed within the Bernoulli probabilistic model. In particular, the following problem is addressed. Assume that a string X is given, with symbols emitted randomly but independently according to some known distribution of probabilities. Then, for each pair (W , Z) of distinct suffixes of X, the expected length of the longest common prefix of W and Z is sought. The collection of these lengths, that are called here self-alignments, plays a crucial role in several algorithmic problems on words, such as building suffix trees or inverted files, detecting squares and other regularities, computing substring statistics, etc. The asymptotically best algorithms for these problems are quite complex and thus risk to be unpractical. The present analysis of self-alignments and related measures suggests that, in a variety of cases, more straightforward algorithmic solutions may yield comparable or even better performances. Key words and ph..
    corecore