6 research outputs found

    PageRank in Malware Categorization

    Get PDF
    The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.In this paper, we propose a malware categorization method that models malware behavior in terms of instructions using PageRank. PageRank computes ranks of web pages based on structural information and can also compute ranks of instructions that represent the structural information of the instructions in malware analysis methods. Our malware categorization method uses the computed ranks as features in machine learning algorithms. In the evaluation, we compare the effectiveness of different PageRank algorithms and also investigate bagging and boosting algorithms to improve the categorization accuracy

    Spear Phishing Attack Detection

    Get PDF
    This thesis addresses the problem of identifying email spear phishing attacks, which are indicative of cyber espionage. Spear phishing consists of targeted emails sent to entice a victim to open a malicious file attachment or click on a malicious link that leads to a compromise of their computer. Current detection methods fail to detect emails of this kind consistently. The SPEar phishing Attack Detection system (SPEAD) is developed to analyze all incoming emails on a network for the presence of spear phishing attacks. SPEAD analyzes the following file types: Windows Portable Executable and Common Object File Format (PE/COFF), Adobe Reader, and Microsoft Excel, Word, and PowerPoint. SPEAD\u27s malware detection accuracy is compared against five commercially-available email anti-virus solutions. Finally, this research quantifies the time required to perform this detection with email traffic loads emulating an Air Force base network. Results show that SPEAD outperforms the anti-virus products in PE/COFF malware detection with an overall accuracy of 99.68% and an accuracy of 98.2% where new malware is involved. Additionally, SPEAD is comparable to the anti-virus products when it comes to the detection of new Adobe Reader malware with a rate of 88.79%. Ultimately, SPEAD demonstrates a strong tendency to focus its detection on new malware, which is a rare and desirable trait. Finally, after less than 4 minutes of sustained maximum email throughput, SPEAD\u27s non-optimized configuration exhibits one-hour delays in processing files and links

    Data Mining Methods For Malware Detection

    Get PDF
    This research investigates the use of data mining methods for malware (malicious programs) detection and proposed a framework as an alternative to the traditional signature detection methods. The traditional approaches using signatures to detect malicious programs fails for the new and unknown malwares case, where signatures are not available. We present a data mining framework to detect malicious programs. We collected, analyzed and processed several thousand malicious and clean programs to find out the best features and build models that can classify a given program into a malware or a clean class. Our research is closely related to information retrieval and classification techniques and borrows a number of ideas from the field. We used a vector space model to represent the programs in our collection. Our data mining framework includes two separate and distinct classes of experiments. The first are the supervised learning experiments that used a dataset, consisting of several thousand malicious and clean program samples to train, validate and test, an array of classifiers. In the second class of experiments, we proposed using sequential association analysis for feature selection and automatic signature extraction. With our experiments, we were able to achieve as high as 98.4% detection rate and as low as 1.9% false positive rate on novel malwares

    R.: Detection of new malicious code using n-grams signatures

    No full text
    Abstract β€” Signature-based malicious code detection is the standard technique in all commercial anti-virus software. This method can detect a virus only after the virus has appeared and caused damage. Signature-based detection performs poorly when attempting to identify new viruses. Motivated by the standard signature-based technique for detecting viruses, and a recent successful text classification method, n-grams analysis, we explore the idea of automatically detecting new malicious code. We employ n-grams analysis to automatically generate signatures from malicious and benign software collections. The n-gramsbased signatures are capable of classifying unseen benign and malicious code. The datasets used are large compared to earlier applications of n-grams analysis

    Source code authorship attribution

    Get PDF
    To attribute authorship means to identify the true author among many candidates for samples of work of unknown or contentious authorship. Authorship attribution is a prolific research area for natural language, but much less so for source code, with eight other research groups having published empirical results concerning the accuracy of their approaches to date. Authorship attribution of source code is the focus of this thesis. We first review, reimplement, and benchmark all existing published methods to establish a consistent set of accuracy scores. This is done using four newly constructed and significant source code collections comprising samples from academic sources, freelance sources, and multiple programming languages. The collections developed are the most comprehensive to date in the field. We then propose a novel information retrieval method for source code authorship attribution. In this method, source code features from the collection samples are tokenised, converted into n-grams, and indexed for stylistic comparison to query samples using the Okapi BM25 similarity measure. Authorship of the top ranked sample is used to classify authorship of each query, and the proportion of times that this is correct determines overall accuracy. The results show that this approach is more accurate than the best approach from the previous work for three of the four collections. The accuracy of the new method is then explored in the context of author style evolving over time, by experimenting with a collection of student programming assignments that spans three semesters with established relative timestamps. We find that it takes one full semester for individual coding styles to stabilise, which is essential knowledge for ongoing authorship attribution studies and quality control in general. We conclude the research by extending both the new information retrieval method and previous methods to provide a complete set of benchmarks for advancing the field. In the final evaluation, we show that the n-gram approaches are leading the field, with accuracy scores for some collections around 90% for a one-in-ten classification problem
    corecore