133 research outputs found

    Feature modeling and cluster analysis of malicious Web traffic

    Get PDF
    Many attackers find Web applications to be attractive targets since they are widely used and have many vulnerabilities to exploit. The goal of this thesis is to study patterns of attacker activities on typical Web based systems using four data sets collected by honeypots, each in duration of almost four months. The contributions of our work include cluster analysis and modeling the features of the malicious Web traffic. Some of our main conclusions are: (1) Features of malicious sessions, such as Number of Requests, Bytes Transferred, and Duration, follow skewed distributions, including heavy-tailed. (2) Number of requests per unique attacker follows skewed distributions, including heavy-tailed, with a small number of attackers submitting most of the malicious traffic. (3) Cluster analysis provides an efficient way to distinguish between attack sessions and vulnerability scan sessions

    Stylistic atructures: a computational approach to text classification

    Get PDF
    The problem of authorship attribution has received attention both in the academic world (e.g. did Shakespeare or Marlowe write Edward III?) and outside (e.g. is this confession really the words of the accused or was it made up by someone else?). Previous studies by statisticians and literary scholars have sought "verbal habits" that characterize particular authors consistently. By and large, this has meant looking for distinctive rates of usage of specific marker words -- as in the classic study by Mosteller and Wallace of the Federalist Papers. The present study is based on the premiss that authorship attribution is just one type of text classification and that advances in this area can be made by applying and adapting techniques from the field of machine learning. Five different trainable text-classification systems are described, which differ from current stylometric practice in a number of ways, in particular by using a wider variety of marker patterns than customary and by seeking such markers automatically, without being told what to look for. A comparison of the strengths and weaknesses of these systems, when tested on a representative range of text-classification problems, confirms the importance of paying more attention than usual to alternative methods of representing distinctive differences between types of text. The thesis concludes with suggestions on how to make further progress towards the goal of a fully automatic, trainable text-classification system

    Improving the Performance and Precision of Bioinformatics Algorithms

    Get PDF
    Recent advances in biotechnology have enabled scientists to generate and collect huge amounts of biological experimental data. Software tools for analyzing both genomic (DNA) and proteomic (protein) data with high speed and accuracy have thus become very important in modern biological research. This thesis presents several techniques for improving the performance and precision of bioinformatics algorithms used by biologists. Improvements in both the speed and cost of automated DNA sequencers have allowed scientists to sequence the DNA of an increasing number of organisms. One way biologists can take advantage of this genomic DNA data is to use it in conjunction with expressed sequence tag (EST) and cDNA sequences to find genes and their splice sites. This thesis describes ESTmapper, a tool designed to use an eager write-only top-down (WOTD) suffix tree to efficiently align DNA sequences against known genomes. Experimental results show that ESTmapper can be much faster than previous techniques for aligning and clustering DNA sequences, and produces alignments of comparable or better quality. Peptide identification by tandem mass spectrometry (MS/MS) is becoming the dominant high-throughput proteomics workflow for protein characterization in complex samples. Biologists currently rely on protein database search engines to identify peptides producing experimentally observed mass spectra. This thesis describes two approaches for improving peptide identification precision using statistical machine learning. HMMatch (HMM MS/MS Match) is a hidden Markov model approach to spectral matching, in which many examples of a peptide fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. Experimental results show that HMMatch can identify many peptides missed by traditional spectral matching and search engines. PepArML (Peptide Identification Arbiter by Machine Learning) is a machine learning based framework for improving the precision of peptide identification. It uses classification algorithms to effectively utilize spectra features and scores from multiple search engines in a single model-free framework that can be trained in an unsupervised manner. Experimental results show that PepArML can improve the sensitivity of peptide identification for several synthetic protein mixtures compared with individual search engines

    Stylistic atructures: a computational approach to text classification

    Get PDF
    The problem of authorship attribution has received attention both in the academic world (e.g. did Shakespeare or Marlowe write Edward III?) and outside (e.g. is this confession really the words of the accused or was it made up by someone else?). Previous studies by statisticians and literary scholars have sought "verbal habits" that characterize particular authors consistently. By and large, this has meant looking for distinctive rates of usage of specific marker words -- as in the classic study by Mosteller and Wallace of the Federalist Papers. The present study is based on the premiss that authorship attribution is just one type of text classification and that advances in this area can be made by applying and adapting techniques from the field of machine learning. Five different trainable text-classification systems are described, which differ from current stylometric practice in a number of ways, in particular by using a wider variety of marker patterns than customary and by seeking such markers automatically, without being told what to look for. A comparison of the strengths and weaknesses of these systems, when tested on a representative range of text-classification problems, confirms the importance of paying more attention than usual to alternative methods of representing distinctive differences between types of text. The thesis concludes with suggestions on how to make further progress towards the goal of a fully automatic, trainable text-classification system

    Understanding and Enriching Randomness Within Resource-Constrained Devices

    Get PDF
    Random Number Generators (RNG) find use throughout all applications of computing, from high level statistical modeling all the way down to essential security primitives. A significant amount of prior work has investigated this space, as a poorly performing generator can have significant impacts on algorithms that rely on it. However, recent explosive growth of the Internet of Things (IoT) has brought forth a class of devices for which common RNG algorithms may not provide an optimal solution. Furthermore, new hardware creates opportunities that have not yet been explored with these devices. in this Dissertation, we present research fostering deeper understanding of and enrichment of the state of randomness within the context of resource-constrained devices. First, we present an exploratory study into methods of generating random numbers on devices with sensors. We perform a data collection study across 37 android devices to determine how much random data is consumed, and which sensors are capable of producing sufficiently entropic data. We use the results of our analysis to create an experimental framework called SensoRNG, which serves as a prototype to test the efficacy of a sensor-based RNG. SensoRNG employs opportunistic collection of data from on-board sensors and applies a light-weight mixing algorithm to produce random numbers. We evaluate SensoRNG with the National Institute of Standards and Technology (NIST) statistical testing suite and demonstrate that a sensor-based RNG can provide high quality random numbers with only little additional overhead. Second, we explore the design, implementation, and efficacy of a Collaborative and Distributed Entropy Transfer protocol (CADET), which explores moving random number generation from an individual task to a collaborative one. Through the sharing of excess random data, devices that are unable to meet their own needs can be aided by contributions from other devices. We implement and test a proof-of-concept version of CADET on a testbed of 49 Raspberry Pi 3B single-board computers, which have been underclocked to emulate resource-constrained devices. Through this, we evaluate and demonstrate the efficacy and baseline performance of remote entropy protocols of this type, as well as highlight remaining research questions and challenges. Finally, we design and implement a system called RightNoise, which automatically profiles the RNG activity of a device by using techniques adapted from language modeling. First, by performing offline analysis, RightNoise is able to mine and reconstruct, in the context of a resource-constrained device, the structure of different activities from raw RNG access logs. After recovering these patterns, the device is able to profile its own behavior in real time. We give a thorough evaluation of the algorithms used in RightNoise and show that, with only five instances of each activity type per log, RightNoise is able to reconstruct the full set of activities with over 90\% accuracy. Furthermore, classification is very quick, with an average speed of 0.1 seconds per block. We finish this work by discussing real world application scenarios for RightNoise
    • …
    corecore