1,595 research outputs found

    Data stream mining techniques: a review

    Get PDF
    A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining

    kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data

    Get PDF
    The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies. In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen cases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available

    Adapted K-Nearest Neighbors for Detecting Anomalies on Spatio–Temporal Traffic Flow

    Get PDF
    Outlier detection is an extensive research area, which has been intensively studied in several domains such as biological sciences, medical diagnosis, surveillance, and traffic anomaly detection. This paper explores advances in the outlier detection area by finding anomalies in spatio-temporal urban traffic flow. It proposes a new approach by considering the distribution of the flows in a given time interval. The flow distribution probability (FDP) databases are first constructed from the traffic flows by considering both spatial and temporal information. The outlier detection mechanism is then applied to the coming flow distribution probabilities, the inliers are stored to enrich the FDP databases, while the outliers are excluded from the FDP databases. Moreover, a k-nearest neighbor for distance-based outlier detection is investigated and adopted for FDP outlier detection. To validate the proposed framework, real data from Odense traffic flow case are evaluated at ten locations. The results reveal that the proposed framework is able to detect the real distribution of flow outliers. Another experiment has been carried out on Beijing data, the results show that our approach outperforms the baseline algorithms for high-urban traffic flow

    Real-time big data processing for anomaly detection : a survey

    Get PDF
    The advent of connected devices and omnipresence of Internet have paved way for intruders to attack networks, which leads to cyber-attack, financial loss, information theft in healthcare, and cyber war. Hence, network security analytics has become an important area of concern and has gained intensive attention among researchers, off late, specifically in the domain of anomaly detection in network, which is considered crucial for network security. However, preliminary investigations have revealed that the existing approaches to detect anomalies in network are not effective enough, particularly to detect them in real time. The reason for the inefficacy of current approaches is mainly due the amassment of massive volumes of data though the connected devices. Therefore, it is crucial to propose a framework that effectively handles real time big data processing and detect anomalies in networks. In this regard, this paper attempts to address the issue of detecting anomalies in real time. Respectively, this paper has surveyed the state-of-the-art real-time big data processing technologies related to anomaly detection and the vital characteristics of associated machine learning algorithms. This paper begins with the explanation of essential contexts and taxonomy of real-time big data processing, anomalous detection, and machine learning algorithms, followed by the review of big data processing technologies. Finally, the identified research challenges of real-time big data processing in anomaly detection are discussed. © 2018 Elsevier Lt

    Introductory Chapter: Data Streams and Online Learning in Social Media

    Get PDF

    Anomaly detection on data streams from vehicular networks

    Get PDF
    As redes veiculares são compostas por nós com elevada mobilidade que apenas estão ativos quando o veículo se encontra em movimento, tornando a rede imprevisível e em constante mudança. Num cenário tão dinâmico, detetar anomalias na rede torna-se uma tarefa exigente, mas crucial. A Veniam opera uma rede veicular que garante conexão fiável através de redes heterogéneas como LTE, Wi-Fi e DSRC, conectando os veículos à Internet e a outros dispositivos espalhados pela cidade. Ao longo do tempo, os nós enviam dados para a Cloud tanto por tecnologias em tempo real como por tecnologias tolerantes a atraso, aumentando a dinâmica da rede. O objetivo desta dissertação é propor e implementar um método para detetar anomalias numa rede veicular real, através de uma análise online dos fluxos de dados enviados dos veículos para a Cloud. Os fluxos da rede foram explorados de forma a caracterizar os dados disponíveis e selecionar casos de uso. Os datasets escolhidos foram submetidos a diferentes técnicas de deteção de anomalias, como previsão de séries temporais e deteção de outliers baseados na densidade da vizinhança, seguido da análise dos trade-offs para selecionar os algoritmos que melhor se ajustam às características dos dados. A solução proposta engloba duas etapas: uma primeira fase de triagem seguida de uma classificação baseada no método dos vizinhos mais próximos. O sistema desenvolvido foi implementado no cluster distribuído da Veniam, que executa Apache Spark, permitindo uma solução rápida e escalável que classifica os dados assim que chegam à Cloud. A performance do método foi avaliada pela sua precisão, i.e., a percentagem de verdadeiras anomalias dentro das anomalias detetadas, quando foi submetido a datasets com anomalias artificiais provenientes de fontes de dados diferentes, recebidas tanto por tecnologias em tempo real como por tecnologias tolerantes a atraso.Vehicular networks are characterized by high mobility nodes that are only active when the vehicle is moving, thus making the network unpredictable and in constant change. In such a dynamic scenario, detecting anomalies in the network is a challenging but crucial task. Veniam operates a vehicular network that ensures reliable connectivity through heterogeneous networks such as LTE, Wi-Fi and DSRC, connecting the vehicles to the Internet and to other devices spread throughout the city. Over time, nodes send data to the cloud either by real time technologies or by delay tolerant ones, increasing the network's dynamics. The aim of this dissertation is to propose and implement a method for detecting anomalies in a real-world vehicular network through means of an online analysis of the data streams that come from the vehicles to the cloud. The network's streams were explored in order to characterize the available data and select target use cases. The chosen datasets were submitted to different anomaly detection techniques, such as time series forecasting and density-based outlier detection, followed by the trade-offs' analysis to select the algorithms that best modeled the data characteristics. The proposed solution comprises two stages: a lightweight screening step, followed by a Nearest Neighbor classification. The developed system was implemented on Veniam's distributed cluster running Apache Spark, allowing a fast and scalable solution that classifies the data as soon as it reaches the Cloud. The performance of the method was evaluated by its precision, i.e., the percentage of true anomalies within the detected outliers, when it was submitted to datasets presenting artificial anomalies from different data sources, received either by real-time or delay tolerant technologies

    Artificial intelligence driven anomaly detection for big data systems

    Get PDF
    The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources. The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98–99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology. The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naïve anomaly detection training. The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces
    • …
    corecore