5 research outputs found

    Data Discovery and Anomaly Detection Using Atypicality: Theory

    Full text link
    A central question in the era of 'big data' is what to do with the enormous amount of information. One possibility is to characterize it through statistics, e.g., averages, or classify it using machine learning, in order to understand the general structure of the overall data. The perspective in this paper is the opposite, namely that most of the value in the information in some applications is in the parts that deviate from the average, that are unusual, atypical. We define what we mean by 'atypical' in an axiomatic way as data that can be encoded with fewer bits in itself rather than using the code for the typical data. We show that this definition has good theoretical properties. We then develop an implementation based on universal source coding, and apply this to a number of real world data sets.Comment: 40 page

    Processing dynamic computer network data for visual analysis

    Get PDF
    In recent times datasets have become larger and more and more difficult to understand for users. Therefore Visual Analytics research investigates on combining automation methods with user related analysis. A special type of this field is security visualization. Since information like connection data or health status of each computer in the network are very abstract the help of automation methods becomes even more important to make interesting outliers and anomalies obvious to the user. This bachelor thesis compares two different approaches for anomaly detection in security visualization. The study is based on the VAST 2013 Mini Challenge 3 and its submission of a University of Stuttgart and Peking University cooperation. This thesis concentrates on comparing the two automation methods seasonal trend decomposition (STL) for numerical data fields, such as bytes and packages, and the sample entropy (or Shannon Entropy) method for categorical data fields, such as IP and port. Both approaches should enable the user to find events in the given network dataset and thus to understand the risks and attacks in the network of the VAST challenges example company Big Marketing. As a result the methods are similar in the quantity of anomalies found, but differ in the type of anomaly. Since the STL focuses on different variables, some variables show more scan events and others more DOS events. Combining all results from the different variables the STL offers a higher number of true anomalies. On the other hand, the sample entropy is more intuitive to use and gives hints on the type of event without using other visualizations. In a small user study the entropy method was clearly preferred and performed in a better way in the result of given tasks. As a conclusion this thesis suggests the entropy methods in any similar context to the given benchmark and system, but also suggests that the STL methods could be more efficient with different parameters of network security

    Information Theory and Machine Learning

    Get PDF
    The recent successes of machine learning, especially regarding systems based on deep neural networks, have encouraged further research activities and raised a new set of challenges in understanding and designing complex machine learning algorithms. New applications require learning algorithms to be distributed, have transferable learning results, use computation resources efficiently, convergence quickly on online settings, have performance guarantees, satisfy fairness or privacy constraints, incorporate domain knowledge on model structures, etc. A new wave of developments in statistical learning theory and information theory has set out to address these challenges. This Special Issue, "Machine Learning and Information Theory", aims to collect recent results in this direction reflecting a diverse spectrum of visions and efforts to extend conventional theories and develop analysis tools for these complex machine learning systems
    corecore