258 research outputs found

    JABBIC Lookups: A Backend Telemetry-Based System for Malware Triage

    Get PDF
    In this paper, we propose JABBIC lookups, a telemetry-based system for malware triage at the interface between proprietary reputation score systems and malware analysts. JABBIC uses file download telemetry collected from client protection solutions installed on end-hosts to determine the threat level of an unknown file based on telemetry data associated with files already known to be malign. We apply word embeddings, and semantic and relational similarities to triage potentially malign files following the intuition that, while single elements in a malware download might change over time, their context, defined as the semantic and relational properties between the different elements in a malware delivery system (e.g., servers, autonomous systems, files) does not change as fast. To this end, we show that JABBIC can leverage file download telemetry to allow security vendors to manage the collection and analysis of unknown files from remote end-hosts for timely processing by more sophisticated malware analysis systems. We test and evaluate JABBIC lookups with 33M download events collected during October 2015. We show that 85.83% of the files triaged with JABBIC lookups are part of the same malware family as their past counterpart files. We also show that, if used with proprietary reputation score systems, JABBIC can triage as malicious 55.1% of files before they are detected by VirusTotal, preceding this detection by over 20 days

    Intrusion detection by automatic extraction of the semantics of computer language grammars

    Get PDF
    Interactions between a user and information systems are based on an inescapable architectural pattern: user data is integrated into requests whose analysis is carried out by an interpreter that drives the system’s activity. Attacks targeting this architecture (known as injection attacks) are very frequent and particularly severe. Most often, this detection is based only on the syntax of this data (e.g. the presence of keywords or sub-strings typical of attacks), with limited knowledge of their semantics (i.e. the effects of the query on the information system). The automatic extraction of these semantics is, therefore, a major challenge, as it would significantly improve the performance of Intrusion Detection Systems (IDS). By leveraging the novel advancement in Natural Language Processing (NLP) it appears feasible to automatically and transparently infer the semantics of user inputs. This Master Thesis provides a framework centred on the instrumentalization of parsers. We focused on parsers for their pivotal role as the first layer of interaction with user inputs and their responsibility for the performed operation on an information system. Our research findings indicate the possibility of constructing an intrusion detection system based on this framework. Moreover, the focus on parser technologies demonstrates the potential for dynamically preventing the processing of malicious input (i.e. creating Intrusion Prevention Systems)

    Facilitating forensic examinations of multi-user computer environments through session-to-session analysis of internet history

    Get PDF
    This paper proposes a new approach to the forensic investigation of Internet history artefacts by aggregating the history from a recovered device into sessions and comparing those sessions to other sessions to determine whether they are one-time events or form a repetitive or habitual pattern. We describe two approaches for performing the session aggregation: fixed-length sessions and variable-length sessions. We also describe an approach for identifying repetitive pattern of life behaviour and show how such patterns can be extracted and represented as binary strings. Using the Jaccard similarity coefficient, a session-to-session comparison can be performed and the sessions can be analysed to determine to what extent a particular session is similar to any other session in the Internet history, and thus is highly likely to correspond to the same user. Experiments have been conducted using two sets of test data, where multiple users have access to the same computer. By identifying patterns of Internet usage that are unique to each user, our approach exhibits a high success rate in attributing particular sessions of the Internet history to the correct user. This can provide considerable help to a forensic investigator trying to establish which user was using the computer when a web-related crime was committed

    On Leveraging Next-Generation Deep Learning Techniques for IoT Malware Classification, Family Attribution and Lineage Analysis

    Get PDF
    Recent years have witnessed the emergence of new and more sophisticated malware targeting insecure Internet of Things (IoT) devices, as part of orchestrated large-scale botnets. Moreover, the public release of the source code of popular malware families such as Mirai [1] has spawned diverse variants, making it harder to disambiguate their ownership, lineage, and correct label. Such a rapidly evolving landscape makes it also harder to deploy and generalize effective learning models against retired, updated, and/or new threat campaigns. To mitigate such threat, there is an utmost need for effective IoT malware detection, classification and family attribution, which provide essential steps towards initiating attack mitigation/prevention countermeasures, as well as understanding the evolutionary trajectories and tangled relationships of IoT malware. This is particularly challenging due to the lack of fine-grained empirical data about IoT malware, the diverse architectures of IoT-targeted devices, and the massive code reuse between IoT malware families. To address these challenges, in this thesis, we leverage the general lack of obfuscation in IoT malware to extract and combine static features from multi-modal views of the executable binaries (e.g., images, strings, assembly instructions), along with Deep Learning (DL) architectures for effective IoT malware classification and family attribution. Additionally, we aim to address concept drift and the limitations of inter-family classification due to the evolutionary nature of IoT malware, by detecting in-class evolving IoT malware variants and interpreting the meaning behind their mutations. To this end, we perform the following to achieve our objectives: First, we analyze 70,000 IoT malware samples collected by a specialized IoT honeypot and popular malware repositories in the past 3 years. Consequently, we utilize features extracted from strings- and image-based representations of IoT malware to implement a multi-level DL architecture that fuses the learned features from each sub-component (i.e, images, strings) through a neural network classifier. Our in-depth experiments with four prominent IoT malware families highlight the significant accuracy of the proposed approach (99.78%), which outperforms conventional single-level classifiers, by relying on different representations of the target IoT malware binaries that do not require expensive feature extraction. Additionally, we utilize our IoT-tailored approach for labeling unknown malware samples, while identifying new malware strains. Second, we seek to identify when the classifier shows signs of aging, by which it fails to effectively recognize new variants and adapt to potential changes in the data. Thus, we introduce a robust and effective method that uses contrastive learning and attentive Transformer models to learn and compare semantically meaningful representations of IoT malware binaries and codes without the need for expensive target labels. We find that the evolution of IoT binaries can be used as an augmentation strategy to learn effective representations to contrast (dis)similar variant pairs. We discuss the impact and findings of our analysis and present several evaluation studies to highlight the tangled relationships of IoT malware, as well as the efficiency of our contrastively learned fine-grained feature vectors in preserving semantics and reducing out-of-vocabulary size in cross-architecture IoT malware binaries. We conclude this thesis by summarizing our findings and discussing research gaps that lay the way for future work

    Text Classification: A Review, Empirical, and Experimental Evaluation

    Full text link
    The explosive and widespread growth of data necessitates the use of text classification to extract crucial information from vast amounts of data. Consequently, there has been a surge of research in both classical and deep learning text classification methods. Despite the numerous methods proposed in the literature, there is still a pressing need for a comprehensive and up-to-date survey. Existing survey papers categorize algorithms for text classification into broad classes, which can lead to the misclassification of unrelated algorithms and incorrect assessments of their qualities and behaviors using the same metrics. To address these limitations, our paper introduces a novel methodological taxonomy that classifies algorithms hierarchically into fine-grained classes and specific techniques. The taxonomy includes methodology categories, methodology techniques, and methodology sub-techniques. Our study is the first survey to utilize this methodological taxonomy for classifying algorithms for text classification. Furthermore, our study also conducts empirical evaluation and experimental comparisons and rankings of different algorithms that employ the same specific sub-technique, different sub-techniques within the same technique, different techniques within the same category, and categorie

    First Steps towards Data-Driven Adversarial Deduplication

    Get PDF
    In traditional databases, the entity resolution problem (which is also known as deduplication)refers to the task of mapping multiple manifestations of virtual objects totheir corresponding real-worldentities. When addressing this problem, in both theory and practice, it is widely assumed that suchsets of virtual objects appear as the result of clerical errors, transliterations, missing or updatedattributes, abbreviations, and so forth. In this paper, we address this problem under the assumptionthat this situation is caused by malicious actors operating in domains in which they do not wishto be identified, such as hacker forums and markets in which the participants are motivated toremain semi-anonymous (though they wish to keep their true identities secret, they find it useful forcustomers to identify their products and services). We are therefore in the presence of a different, andeven more challenging, problem that we refer to as adversarial deduplication. In this paper, we studythis problem via examples that arise from real-world data on malicious hacker forums and marketsarising from collaborations with a cyber threat intelligence company focusing on understanding thiskind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data onwhich to build solutions to this problem, and develop a set of preliminary experiments based ontraining machine learning classifiers that leverage text analysis to detect potential cases of duplicateentities. Our results are encouraging as a first step towards building tools that human analysts canuse to enhance their capabilities towards fighting cyber threats.Fil: Paredes, José Nicolás. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Simari, Gerardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Arizona State University; Estados UnidosFil: Martinez, Maria Vanina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires; ArgentinaFil: Falappa, Marcelo Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentin

    AUTOMATIC FEATURE ENGINEERING FOR DISCOVERING AND EXPLAINING MALICIOUS BEHAVIORS

    Get PDF
    A key task of cybersecurity is to discover and explain malicious behaviors of malware. The understanding of malicious behaviors helps us further develop good features and apply machine learning techniques to detect various attacks. The effectiveness of machine learning techniques primarily depends on the manual feature engineering process, based on human knowledge and intuition. However, given the adversaries’ efforts to evade detection and the growing volume of publications on malicious behaviors, the feature engineering process likely draws from a fraction of the relevant knowledge. Therefore, it is necessary and important to design an automated system to engineer features for discovering malicious behaviors and detecting attacks. First, we describe a knowledge-based feature engineering technique for malware detection. It mines documents written in natural language (e.g. scientific literature), and represents and queries the knowledge about malware in a way that mirrors the human feature engineering process. We implement the idea in a system called FeatureSmith, which generates a feature set for detecting Android malware. We train a classifier using these features on a large data set of benign and malicious apps. This classifier achieves comparable performance to a state-of-the-art Android malware detector that relies on manually engineered features. In addition, FeatureSmith is able to suggest informative features that are absent from the manually engineered set and to link the features generated to abstract concepts that describe malware behaviors. Second, we propose a data-driven feature engineering technique called ReasonSmith, which explains machine learning models by ranking features based on their global importance. Instead of interpreting how neural networks make decisions for one specific sample, ReasonSmith captures general importance in terms of the whole data set. In addition, ReasonSmith allows us to efficiently identify data biases and artifacts, by comparing feature rankings over time. We further summarize the common data biases and artifacts for malware detection problems at the level of API calls. Third, we study malware detection from a global view, and explore automatic feature engineering problem in analyzing campaigns that include a series of actions. We implement a system ChainSmith to bridge large-scale field measurement and manual campaign report by extracting and categorizing IOCs (indicators of compromise) from security blogs. The semantic roles of IOCs allow us to link qualitative data (e.g. security blogs) to quantitative measurements, which brings new insights to malware campaigns. In particular, we study the effectiveness of different persuasion techniques used on enticing user to download the payloads. We find that the campaign usually starts from social engineering and “missing codec” ruse is a common persuasion technique that generates the most suspicious downloads each day

    Visualizing Incongruity: Visual Data Mining Strategies for Modeling Humor in Text

    Get PDF
    The goal of this project is to investigate the use of visual data mining to model verbal humor. We explored various means of text visualization to identify key featrues of garden path jokes as compared with non jokes. With garden path jokes one interpretation is established in the setup but new information indicating some alternative interpretation triggers some resolution process leading to a new interpretation. For this project we visualize text in three novel ways, assisted by some web mining to build an informal ontology, that allow us to see the differences between garden path jokes and non jokes of similar form. We used the results of the visualizations to build a rule based model which was then compared with models from tradtitional data mining toi show the use of visual data mining. Additional experiments with other forms of incongruity including visualization of ’shilling’ or the introduction of false reviews into a product review set. The results are very similar to that of garden path jokes and start to show us there is a shape to incongruity. Overall this project shows as that the proposed methodologies and tools offer a new approach to testing and generating hypotheses related to theories of humor as well as other phenomena involving opposition, incongruities, and shifts in classification
    corecore