5 research outputs found

    Ensemble clustering via heuristic optimisation

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityTraditional clustering algorithms have different criteria and biases, and there is no single algorithm that can be the best solution for a wide range of data sets. This problem often presents a significant obstacle to analysts in revealing meaningful information buried among the huge amount of data. Ensemble Clustering has been proposed as a way to avoid the biases and improve the accuracy of clustering. The difficulty in developing Ensemble Clustering methods is to combine external information (provided by input clusterings) with internal information (i.e. characteristics of given data) effectively to improve the accuracy of clustering. The work presented in this thesis focuses on enhancing the clustering accuracy of Ensemble Clustering by employing heuristic optimisation techniques to achieve a robust combination of relevant information during the consensus clustering stage. Two novel heuristic optimisation-based Ensemble Clustering methods, Multi-Optimisation Consensus Clustering (MOCC) and K-Ants Consensus Clustering (KACC), are developed and introduced in this thesis. These methods utilise two heuristic optimisation algorithms (Simulated Annealing and Ant Colony Optimisation) for their Ensemble Clustering frameworks, and have been proved to outperform other methods in the area. The extensive experimental results, together with a detailed analysis, will be presented in this thesis

    Large Scale Malware Analysis, Detection and Signature Generation.

    Full text link
    As the primary vehicle for most organized cybercrimes, malicious software (or malware) has become one of the most serious threats to computer systems and the Internet. With the recent advent of automated malware development toolkits, it has become relatively easy, even for marginally skilled adversaries, to create and mutate malware, bypassing Anti-Virus (AV) detection. This has led to a surge in the number of new malware threats and has created several major challenges for the AV industry. AV companies typically receive tens of thousands of suspicious samples daily. However, the overwhelming number of new malware easily overtax the available human resources at AV companies, making them less responsive to emerging threats and leading to poor detection rates. To address these issues, this dissertation proposes several new and scalable systems to facilitate malware analysis and detection, with the focus on a central theme: ``automation and scalability". This dissertation makes four primary contributions. First, it builds a large-scale malware database management system called SMIT that addresses the challenges of determining whether a suspicious sample is indeed malicious. SMIT exploits the insight that most new malicious samples are simple syntactic variations of existing malware. Thus, one way to ascertain the maliciousness of an unknown sample is to check if it is sufficiently similar to any existing malware. SMIT is designed to make such decisions efficiently using malware's function call graph---a high-level structural representation that is less susceptible to the low-level obfuscation employed by malware writers to evade detection. Second, the dissertation develops an automatic malware clustering system called MutantX. By quickly grouping similar samples into clusters, MutantX allows malware analysts to focus on representative samples and automatically generate labels based on samples’ association with existing groups. Third, this dissertation introduces a signature-generation system, called Hancock, that automatically creates high-quality string signatures with extremely low false-positive rates. Finally, observing that two widely used malware analysis approaches---i.e., static and dynamic analyses---have their respective pros and cons, this dissertation proposes a novel system that optimally integrates static-feature and dynamic-behavior based malware clusterings, mitigating their respective shortcomings without losing their merits.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89760/1/huxin_1.pd

    A new efficient approach in clustering ensembles

    No full text
    Abstract. Previous clustering ensemble algorithms usually use a consensus function to obtain a final partition from the outputs of the initial clustering. In this paper, we propose a new clustering ensemble method, which generates a new feature space from initial clustering outputs. Multiple runs of an initial clustering algorithm like k-means generate a new feature space, which is significantly better than pure or normalized feature space. Therefore, running a simple clustering algorithm on generated feature space can obtain the final partition significantly better than pure data. In this method, we use a modification of k-means for initial clustering runs named as "Intelligent kmeans", which is especially defined for clustering ensembles. The results of the proposed method are presented using both simple k-means and intelligent kmeans. Fast convergence and appropriate behavior are the most interesting points of the proposed method. Experimental results on real data sets show effectiveness of the proposed method
    corecore