399 research outputs found

    Application of information theory and statistical learning to anomaly detection

    Get PDF
    In today\u27s highly networked world, computer intrusions and other attacks area constant threat. The detection of such attacks, especially attacks that are new or previously unknown, is important to secure networks and computers. A major focus of current research efforts in this area is on anomaly detection.;In this dissertation, we explore applications of information theory and statistical learning to anomaly detection. Specifically, we look at two difficult detection problems in network and system security, (1) detecting covert channels, and (2) determining if a user is a human or bot. We link both of these problems to entropy, a measure of randomness information content, or complexity, a concept that is central to information theory. The behavior of bots is low in entropy when tasks are rigidly repeated or high in entropy when behavior is pseudo-random. In contrast, human behavior is complex and medium in entropy. Similarly, covert channels either create regularity, resulting in low entropy, or encode extra information, resulting in high entropy. Meanwhile, legitimate traffic is characterized by complex interdependencies and moderate entropy. In addition, we utilize statistical learning algorithms, Bayesian learning, neural networks, and maximum likelihood estimation, in both modeling and detecting of covert channels and bots.;Our results using entropy and statistical learning techniques are excellent. By using entropy to detect covert channels, we detected three different covert timing channels that were not detected by previous detection methods. Then, using entropy and Bayesian learning to detect chat bots, we detected 100% of chat bots with a false positive rate of only 0.05% in over 1400 hours of chat traces. Lastly, using neural networks and the idea of human observational proofs to detect game bots, we detected 99.8% of game bots with no false positives in 95 hours of traces. Our work shows that a combination of entropy measures and statistical learning algorithms is a powerful and highly effective tool for anomaly detection

    Topic Detection and Compressed Classification in Twitter

    No full text
    International audienceIn this paper we introduce a novel information propagation method in Twitter, while maintaining a low computational complexity. It exploits the power of Compressive Sensing in conjunction with a Kalman filter to update the states of a dynamical system. The proposed method first employs Joint Complexity, which is defined as the cardinality of a set of all distinct factors of a given string represented by suffix trees, to perform topic detection. Then based on the nature of the data, we apply the theory of Compressive Sensing to perform topic classification by recovering an indicator vector, while reducing significantly the amount of information from the tweets. We exploit datasets in various languages collected by using the Twitter streaming API and achieve better classification accuracy when compared with state-of-the-art methods

    The complexity landscape of viral genomes

    Get PDF
    Background Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes' organization, relation, and fundamental characteristics. Results This work provides a comprehensive landscape of the viral genome's complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers. Conclusions This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes' organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.Peer reviewe

    Link-based similarity search to fight web spam

    Get PDF
    www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1

    Ionospheric Calibration of Low Frequency Radio Interferometric Observations using the Peeling Scheme: I. Method Description and First Results

    Full text link
    Calibration of radio interferometric observations becomes increasingly difficult towards lower frequencies. Below ~300 MHz, spatially variant refractions and propagation delays of radio waves traveling through the ionosphere cause phase rotations that can vary significantly with time, viewing direction and antenna location. In this article we present a description and first results of SPAM (Source Peeling and Atmospheric Modeling), a new calibration method that attempts to iteratively solve and correct for ionospheric phase errors. To model the ionosphere, we construct a time-variant, 2-dimensional phase screen at fixed height above the Earth's surface. Spatial variations are described by a truncated set of discrete Karhunen-Loeve base functions, optimized for an assumed power-law spectral density of free electrons density fluctuations, and a given configuration of calibrator sources and antenna locations. The model is constrained using antenna-based gain phases from individual self-calibrations on the available bright sources in the field-of-view. Application of SPAM on three test cases, a simulated visibility data set and two selected 74 MHz VLA data sets, yields significant improvements in image background noise (5-75 percent reduction) and source peak fluxes (up to 25 percent increase) as compared to the existing self-calibration and field-based calibration methods, which indicates a significant improvement in ionospheric phase calibration accuracy.Comment: 23 pages, 14 figures, 2 tables. Accepted for publication in A&A. Changes in v2: Corrected minor error in Equations A.3 and A.12. Extended acknowledgment

    On utilizing weak estimators to achieve the online classification of data streams

    Get PDF
    Author's accepted version (post-print).Available from 03/09/2021.acceptedVersio

    Evidence-based Cybersecurity: Data-driven and Abstract Models

    Get PDF
    Achieving computer security requires both rigorous empirical measurement and models to understand cybersecurity phenomena and the effectiveness of defenses and interventions. To address the growing scale of cyber-insecurity, my approach to protecting users employs principled and rigorous measurements and models. In this dissertation, I examine four cybersecurity phenomena. I show that data-driven and abstract modeling can reveal surprising conclusions about longterm, persistent problems, like spam and malware, and growing threats like data-breaches and cyber conflict. I present two data-driven statistical models and two abstract models. Both of the data-driven models show that the presence of heavy-tailed distributions can make naive analysis of trends and interventions misleading. First, I examine ten years of publicly reported data breaches and find that there has been no increase in size or frequency. I also find that reported and perceived increases can be explained by the heavy-tailed nature of breaches. In the second data-driven model, I examine a large spam dataset, analyzing spam concentrations across Internet Service Providers. Again, I find that the heavy-tailed nature of spam concentrations complicates analysis. Using appropriate statistical methods, I identify unique risk factors with significant impact on local spam levels. I then use the model to estimate the effect of historical botnet takedowns and find they are frequently ineffective at reducing global spam concentrations and have highly variable local effects. Abstract models are an important tool when data are unavailable. Even without data, I evaluate both known and hypothesized interventions used by search providers to protect users from malicious websites. I present a Markov model of malware spread and study the effect of two potential interventions: blacklisting and depreferencing. I find that heavy-tailed traffic distributions obscure the effects of interventions, but with my abstract model, I showed that lowering search rankings is a viable alternative to blacklisting infected pages. Finally, I study how game-theoretic models can help clarify strategic decisions in cyber-conflict. I find that, in some circumstances, improving the attribution ability of adversaries may decrease the likelihood of escalating cyber conflict
    corecore