10 research outputs found

    Lempel-Ziv Networks

    Full text link
    Sequence processing has long been a central area of machine learning research. Recurrent neural nets have been successful in processing sequences for a number of tasks; however, they are known to be both ineffective and computationally expensive when applied to very long sequences. Compression-based methods have demonstrated more robustness when processing such sequences -- in particular, an approach pairing the Lempel-Ziv Jaccard Distance (LZJD) with the k-Nearest Neighbor algorithm has shown promise on long sequence problems (up to T=200,000,000T=200,000,000 steps) involving malware classification. Unfortunately, use of LZJD is limited to discrete domains. To extend the benefits of LZJD to a continuous domain, we investigate the effectiveness of a deep-learning analog of the algorithm, the Lempel-Ziv Network. While we achieve successful proof of concept, we are unable to improve meaningfully on the performance of a standard LSTM across a variety of datasets and sequence processing tasks. In addition to presenting this negative result, our work highlights the problem of sub-par baseline tuning in newer research areas.Comment: I Can't Believe It's Not Better Workshop at NeurIPS 202

    Engineering a Simplified 0-Bit Consistent Weighted Sampling

    Full text link
    The Min-Hashing approach to sketching has become an important tool in data analysis, information retrial, and classification. To apply it to real-valued datasets, the ICWS algorithm has become a seminal approach that is widely used, and provides state-of-the-art performance for this problem space. However, ICWS suffers a computational burden as the sketch size K increases. We develop a new Simplified approach to the ICWS algorithm, that enables us to obtain over 20x speedups compared to the standard algorithm. The veracity of our approach is demonstrated empirically on multiple datasets and scenarios, showing that our new Simplified CWS obtains the same quality of results while being an order of magnitude faster

    Security Enhancing Technologies for Cloud-of-Clouds

    Get PDF

    Robust Mobile Malware Detection

    Get PDF
    The increasing popularity and use of smartphones and hand-held devices have made them the most popular target for malware attackers. Researchers have proposed machine learning-based models to automatically detect malware attacks on these devices. Since these models learn application behaviors solely from the extracted features, choosing an appropriate and meaningful feature set is one of the most crucial steps for designing an effective mobile malware detection system. There are four categories of features for mobile applications. Previous works have taken arbitrary combinations of these categories to design models, resulting in sub-optimal performance. This thesis systematically investigates the individual impact of these feature categories on mobile malware detection systems. Feature categories that complement each other are investigated and categories that add redundancy to the feature space (thereby degrading the performance) are analyzed. In the process, the combination of feature categories that provides the best detection results is identified. Ensuring reliability and robustness of the above-mentioned malware detection systems is of utmost importance as newer techniques to break down such systems continue to surface. Adversarial attack is one such evasive attack that can bypass a detection system by carefully morphing a malicious sample even though the sample was originally correctly identified by the same system. Self-crafted adversarial samples can be used to retrain a model to defend against such attacks. However, randomly using too many such samples, as is currently done in the literature, can further degrade detection performance. This work proposed two intelligent approaches to retrain a classifier through the intelligent selection of adversarial samples. The first approach adopts a distance-based scheme where the samples are chosen based on their distance from malware and benign cluster centers while the second selects the samples based on a probability measure derived from a kernel-based learning method. The second method achieved a 6% improvement in terms of accuracy. To ensure practical deployment of malware detection systems, it is necessary to keep the real-world data characteristics in mind. For example, the benign applications deployed in the market greatly outnumber malware applications. However, most studies have assumed a balanced data distribution. Also, techniques to handle imbalanced data in other domains cannot be applied directly to mobile malware detection since they generate synthetic samples with broken functionality, making them invalid. In this regard, this thesis introduces a novel synthetic over-sampling technique that ensures valid sample generation. This technique is subsequently combined with a dynamic cost function in the learning scheme that automatically adjusts minority class weight during model training which counters the bias towards the majority class and stabilizes the model. This hybrid method provided a 9% improvement in terms of F1-score. Aiming to design a robust malware detection system, this thesis extensively studies machine learning-based mobile malware detection in terms of best feature category combination, resilience against evasive attacks, and practical deployment of detection models. Given the increasing technological advancements in mobile and hand-held devices, this study will be very useful for designing robust cybersecurity systems to ensure safe usage of these devices.Doctor of Philosoph

    Avaliação da viabilidade de modelos filogenéticos na classificação de aplicações maliciosas

    Get PDF
    Orientador: André Ricardo Abed GrégioTese (Doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 03/02/2023Inclui referências: p. 150-170Área de concentração: Ciência da ComputaçãoResumo: Milhares de códigos maliciosos são criados, modificados com apoio de ferramentas de automação e liberados diariamente na rede mundial de computadores. Entre essas ameaças, malware são programas projetados especificamente para interromper, danificar ou obter acesso não autorizado a um sistema ou dispositivo. Para facilitar a identificação e a categorização de comportamentos comuns, estruturas e outras características de malware, possibilitando o desenvolvimento de soluções de defesa, existem estratégias de análise que classificam malware em grupos conhecidos como famílias. Uma dessas estratégias é a Filogenia, técnica baseada na Biologia, que investiga o relacionamento histórico e evolutivo de uma espécie ou outro grupo de elementos. Além disso, a utilização de técnicas de agrupamento em conjuntos semelhantes facilita tarefas de engenharia reversa para análise de variantes desconhecidas. Uma variante se refere a uma nova versão de um código malicioso que é criada a partir de modificações de malware existentes. O presente trabalho investiga a viabilidade do uso de filogenias e de métodos de agrupamento na classificação de variantes de malware para plataforma Android. Inicialmente foram analisados 82 trabalhos correlatos para verificação de configurações de experimentos do estado da arte. Após esse estudo, foram realizados quatro experimentos para avaliar uso de métricas de similaridade e de algoritmos de agrupamento na classificação de variantes e na análise de similaridade entre famílias. Propôs-se então um Fluxo de Atividades para Agrupamento de malware com o objetivo de auxiliar na definição de parâmetros para técnicas de agrupamentos, incluindo métricas de similaridade, tipo de algoritmo de agrupamento a ser utilizado e seleção de características. Como prova de conceito, foi proposto o framework Androidgyny para análise de amostras, extração de características e classificação de variantes com base em medóides (elementos representativos médios de cada grupo) e características exclusivas de famílias conhecidas. Para validar o Androidgyny foram feitos dois experimentos: um comparativo com a ferramenta correlata Gefdroid e outro, com exemplares das 25 famílias mais populosas do dataset Androzoo.Abstract: Thousands of malicious codes are created, modified with the support of tools of automation and released daily on the world wide web. Among these threats, malware are programs specifically designed to interrupt, damage, or gain access unauthorized access to a system or device. To facilitate identification and categorization of common behaviors, structures and other characteristics of malware, enabling the development of defense solutions, there are analysis strategies that classify malware into groups known as families. One of these strategies is Phylogeny, a technique based on the Biology, which investigates the historical and evolutionary relationship of a species or other group of elements. In addition, the use of clustering techniques on similar sets facilitates reverse engineering tasks for analysis of unknown variants. a variant refers to a new version of malicious code that is created from modifications of existing malware. The present work investigates the feasibility of using phylogenies and methods of grouping in the classification of malware variants for the Android platform. Initially 82 related works were analyzed to verify experiment configurations of the state of the art. After this study, four experiments were carried out to evaluate the use of similarity measures and clustering algorithms in the classification of variants and in the similarity analysis between families. In addition to these experiments, a Flow of Activities for Malware grouping with five distinct phases. This flow has purpose of helping to define parameters for clustering techniques, including measures of similarity, type of clustering algorithm to be used and feature selection. After defining the flow of activities, the Androidgyny framework was proposed, a prototype for sample analysis, feature extraction and classification of variants based on medoids and unique features of known families. To validate Androidgyny were Two experiments were carried out: a comparison with the related tool Gefdroid and another with copies of the 25 most populous families in the Androzoo dataset
    corecore