800 research outputs found

    Detecting anomalous behaviour using heterogeneous data

    Get PDF
    In this paper, we propose a method to detect anomalous behaviour using heterogenous data. This method detects anomalies based on the recently introduced approach known as Recursive Density Estimation (RDE) and the so called eccentricity. This method does not require prior assumptions to be made on the type of the data distribution. A simplified form of the well-known Chebyshev condition (inequality) is used for the standardised eccentricity and it applies to any type of distribution. This method is applied to three datasets which include credit card, loyalty card and GPS data. Experimental results show that the proposed method may simplify the complex real cases of forensic investigation which require processing huge amount of heterogeneous data to find anomalies. The proposed method can simplify the tedious job of processing the data and assist the human expert in making important decisions. In our future research, more data will be applied such as natural language (e.g. email, Twitter, SMS) and images

    Adaptive Learning Algorithms for Non-stationary Data

    Get PDF
    With the wide availability of large amounts of data and acute need for extracting useful information from such data, intelligent data analysis has attracted great attention and contributed to solving many practical tasks, ranging from scientific research, industrial process and daily life. In many cases the data evolve over time or change from one domain to another. The non-stationary nature of the data brings a new challenge for many existing learning algorithms, which are based on the stationary assumption. This dissertation addresses three crucial problems towards the effective handling of non-stationary data by investigating systematic methods for sample reweighting. Sample reweighting is a problem that infers sample-dependent weights for a data collection to match another data collection which exhibits distributional difference. It is known as the density-ratio estimation problem and the estimation results can be used in several machine learning tasks. This research proposes a set of methods for distribution matching by developing novel density-ratio methods that incorporate the characters of different non-stationary data analysis tasks. The contributions are summarized below. First, for the domain adaptation of classification problems a novel discriminative density-ratio method is proposed. This approach combines three learning objectives: minimizing generalized risk on the reweighted training data, minimizing class-wise distribution discrepancy and maximizing the separation margin on the test data. To solve the discriminative density-ratio problem, two algorithms are presented on the basis of a block coordinate update optimization scheme. Experiments conducted on different domain adaptation scenarios demonstrate the effectiveness of the proposed algorithms. Second, for detecting novel instances in the test data a locally-adaptive kernel density-ratio method is proposed. While traditional novelty detection algorithms are limited to detect either emerging novel instances which are completely new, or evolving novel instances whose distribution are different from previously-seen ones, the proposed algorithm builds on the success of the idea of using density ratio as a measure of evolving novelty and augments with structural information of each data instance's neighborhood. This makes the estimation of density ratio more reliable, and results in detection of emerging as well as evolving novelties. In addition, the proposed locally-adaptive kernel novelty detection method is applied in the social media analysis and shows favorable performance over other existing approaches. As the time continuity of social media streams, the novelty is usually characterized by the combination of emerging and evolving. One reason is the existence of large common vocabularies between different topics. Another reason is that there are high possibilities of topics being continuously discussed in sequential batch of collections, but showing different level of intensity. Thus, the presented novelty detection algorithm demonstrates its effectiveness in the social media data analysis. Lastly, an auto-tuning method for the non-parametric kernel mean matching estimator is presented. It introduces a new quality measure for evaluating the goodness of distribution matching which reflects the normalized mean square error of estimates. The proposed quality measure does not depend on the learner in the following step and accordingly allows the model selection procedures for importance estimation and prediction model learning to be completely separated

    A framework for clustering and adaptive topic tracking on evolving text and social media data streams.

    Get PDF
    Recent advances and widespread usage of online web services and social media platforms, coupled with ubiquitous low cost devices, mobile technologies, and increasing capacity of lower cost storage, has led to a proliferation of Big data, ranging from, news, e-commerce clickstreams, and online business transactions to continuous event logs and social media expressions. These large amounts of online data, often referred to as data streams, because they get generated at extremely high throughputs or velocity, can make conventional and classical data analytics methodologies obsolete. For these reasons, the issues of management and analysis of data streams have been researched extensively in recent years. The special case of social media Big Data brings additional challenges, particularly because of the unstructured nature of the data, specifically free text. One classical approach to mine text data has been Topic Modeling. Topic Models are statistical models that can be used for discovering the abstract ``topics\u27\u27 that may occur in a corpus of documents. Topic models have emerged as a powerful technique in machine learning and data science, providing a great balance between simplicity and complexity. They also provide sophisticated insight without the need for real natural language understanding. However they have not been designed to cope with the type of text data that is abundant on social media platforms, but rather for traditional medium size corpora consisting of longer documents, adhering to a specific language and typically spanning a stable set of topics. Unlike traditional document corpora, social media messages tend to be very short, sparse, noisy, and do not adhere to a standard vocabulary, linguistic patterns, or stable topic distributions. They are also generated at high velocity that impose high demands on topic modeling; and their evolving or dynamic nature, makes any set of results from topic modeling quickly become stale in the face of changes in the textual content and topics discussed within social media streams. In this dissertation, we propose an integrated topic modeling framework built on top of an existing stream-clustering framework called Stream-Dashboard, which can extract, isolate, and track topics over any given time period. In this new framework, Stream Dashboard first clusters the data stream points into homogeneous groups. Then data from each group is ushered to the topic modeling framework which extracts finer topics from the group. The proposed framework tracks the evolution of the clusters over time to detect milestones corresponding to changes in topic evolution, and to trigger an adaptation of the learned groups and topics at each milestone. The proposed approach to topic modeling is different from a generic Topic Modeling approach because it works in a compartmentalized fashion, where the input document stream is split into distinct compartments, and Topic Modeling is applied on each compartment separately. Furthermore, we propose extensions to existing topic modeling and stream clustering methods, including: an adaptive query reformulation approach to help focus on the topic discovery with time; a topic modeling extension with adaptive hyper-parameter and with infinite vocabulary; an adaptive stream clustering algorithm incorporating the automated estimation of dynamic, cluster-specific temporal scales for adaptive forgetting to help facilitate clustering in a fast evolving data stream. Our experimental results show that the proposed adaptive forgetting clustering algorithm can mine better quality clusters; that our proposed compartmentalized framework is able to mine topics of better quality compared to competitive baselines; and that the proposed framework can automatically adapt to focus on changing topics using the proposed query reformulation strategy

    Twitter Bots’ Detection with Benford’s Law and Machine Learning

    Get PDF
    Online Social Networks (OSNs) have grown exponentially in terms of active users and have now become an influential factor in the formation of public opinions. For this reason, the use of bots and botnets for spreading misinformation on OSNs has become a widespread concern. Identifying bots and botnets on Twitter can require complex statistical methods to score a profile based on multiple features. Benford’s Law, or the Law of Anomalous Numbers, states that, in any naturally occurring sequence of numbers, the First Significant Leading Digit (FSLD) frequency follows a particular pattern such that they are unevenly distributed and reducing. This principle can be applied to the first-degree egocentric network of a Twitter profile to assess its conformity to such law and, thus, classify it as a bot profile or normal profile. This paper focuses on leveraging Benford’s Law in combination with various Machine Learning (ML) classifiers to identify bot profiles on Twitter. In addition, a comparison with other statistical methods is produced to confirm our classification results

    Word Embeddings for Fake Malware Generation

    Get PDF
    Signature and anomaly-based techniques are the fundamental methods to detect malware. However, in recent years this type of threat has advanced to become more complex and sophisticated, making these techniques less effective. For this reason, researchers have resorted to state-of-the-art machine learning techniques to combat the threat of information security. Nevertheless, despite the integration of the machine learning models, there is still a shortage of data in training that prevents these models from performing at their peak. In the past, generative models have been found to be highly effective at generating image-like data that are similar to the actual data distribution. In this paper, we leverage the knowledge of generative modeling on opcode sequences and aim to generate malware samples by taking advantage of the contextualized embeddings from BERT. We obtained promising results when differentiating between real and generated samples. We observe that generated malware has such similar characteristics to actual malware that the classifiers are having difficulty in distinguishing between the two, in which the classifiers falsely identify the generated malware as actual malware almost of the time

    A Blockchain-Based Retribution Mechanism for Collaborative Intrusion Detection

    Get PDF
    Collaborative intrusion detection approach uses the shared detection signature between the collaborative participants to facilitate coordinated defense. In the context of collaborative intrusion detection system (CIDS), however, there is no research focusing on the efficiency of the shared detection signature. The inefficient detection signature costs not only the IDS resource but also the process of the peer-to-peer (P2P) network. In this paper, we therefore propose a blockchain-based retribution mechanism, which aims to incentivize the participants to contribute to verifying the efficiency of the detection signature in terms of certain distributed consensus. We implement a prototype using Ethereum blockchain, which instantiates a token-based retribution mechanism and a smart contract-enabled voting-based distributed consensus. We conduct a number of experiments built on the prototype, and the experimental results demonstrate the effectiveness of the proposed approach

    Robustness of Image-Based Malware Analysis

    Get PDF
    In previous work, “gist descriptor” features extracted from images have been used in malware classification problems and have shown promising results. In this research, we determine whether gist descriptors are robust with respect to malware obfuscation techniques, as compared to Convolutional Neural Networks (CNN) trained directly on malware images. Using the Python Image Library (PIL), we create images from malware executables and from malware that we obfuscate. We conduct experiments to compare classifying these images with a CNN as opposed to extracting the gist descriptor features from these images to use in classification. For the gist descriptors, we consider a variety of classification algorithms including k-nearest neighbors, random forest, support vector machine, and multi-layer perceptron. We find that gist descriptors are more robust than CNNs, with respect to the obfuscation techniques that we consider

    Impact of Location Spoofing Attacks on Performance Prediction in Mobile Networks

    Get PDF
    Performance prediction in wireless mobile networks is essential for diverse purposes in network management and operation. Particularly, the position of mobile devices is crucial to estimating the performance in the mobile communication setting. With its importance, this paper investigates mobile communication performance based on the coordinate information of mobile devices. We analyze a recent 5G data collection and examine the feasibility of location-based performance prediction. As location information is key to performance prediction, the basic assumption of making a relevant prediction is the correctness of the coordinate information of devices given. With its criticality, this paper also investigates the impact of position falsification on the ML-based performance predictor, which reveals the significant degradation of the prediction performance under such attacks, suggesting the need for effective defense mechanisms against location spoofing threats

    A Blockchain-Based Tamper-Resistant Logging Framework

    Get PDF
    Since its introduction in Bitcoin, the blockchain has proven to be a versatile data structure. In its role as an immutable ledger, it has grown beyond its initial use in financial transactions to be used in recording a wide variety of other useful information. In this paper, we explore the application of the blockchain outside of its traditional decentralized, financial domain. We show how, even with only a single “mining” node, a proof-of-work blockchain can be the cornerstone of a tamper resistant logging framework. By attaching a proof-of-work to blocks of logging messages, we make it increasingly difficult for an attacker to modify those logs even after totally compromising the system. Furthermore, we discuss various strategies an attacker might take to modify the logs without detection and show how effective those evasion techniques are against statistical analysis
    • …
    corecore