415 research outputs found

    Efficient classification using parallel and scalable compressed model and Its application on intrusion detection

    Full text link
    In order to achieve high efficiency of classification in intrusion detection, a compressed model is proposed in this paper which combines horizontal compression with vertical compression. OneR is utilized as horizontal com-pression for attribute reduction, and affinity propagation is employed as vertical compression to select small representative exemplars from large training data. As to be able to computationally compress the larger volume of training data with scalability, MapReduce based parallelization approach is then implemented and evaluated for each step of the model compression process abovementioned, on which common but efficient classification methods can be directly used. Experimental application study on two publicly available datasets of intrusion detection, KDD99 and CMDC2012, demonstrates that the classification using the compressed model proposed can effectively speed up the detection procedure at up to 184 times, most importantly at the cost of a minimal accuracy difference with less than 1% on average

    Security in Data Mining- A Comprehensive Survey

    Get PDF
    Data mining techniques, while allowing the individuals to extract hidden knowledge on one hand, introduce a number of privacy threats on the other hand. In this paper, we study some of these issues along with a detailed discussion on the applications of various data mining techniques for providing security. An efficient classification technique when used properly, would allow an user to differentiate between a phishing website and a normal website, to classify the users as normal users and criminals based on their activities on Social networks (Crime Profiling) and to prevent users from executing malicious codes by labelling them as malicious. The most important applications of Data mining is the detection of intrusions, where different Data mining techniques can be applied to effectively detect an intrusion and report in real time so that necessary actions are taken to thwart the attempts of the intruder. Privacy Preservation, Outlier Detection, Anomaly Detection and PhishingWebsite Classification are discussed in this paper

    Distributed analysis of vertically partitioned sensor measurements under communication constraints

    Get PDF
    Nowadays, large amounts of data are automatically generated by devices and sensors. They measure, for instance, parameters of production processes, environmental conditions of transported goods, energy consumption of smart homes, traffic volume, air pollution and water consumption, or pulse and blood pressure of individuals. The collection and transmission of data is enabled by electronics, software, sensors and network connectivity embedded into physical objects. The objects and infrastructure connecting such objects are called the Internet of Things (IoT). In 2010, already 12.5 billion devices were connected to the IoT, a number about twice as large as the world's population at that time. The IoT provides us with data about our physical environment, at a level of detail never known before in human history. Understanding such data creates opportunities to improve our way of living, learning, working, and entertaining. For instance, the information obtained from data analysis modules embedded into existing processes could help their optimization, leading to more sustainable systems which save resources in sectors such as manufacturing, logistics, energy and utilities, the public sector, or healthcare. IoT's inherent distributed nature, the resource constraints and dynamism of its networked participants, as well as the amounts and diverse types of data collected are challenging even the most advanced automated data analysis methods known today. Currently, there is a strong research focus on the centralization of all data in the cloud, processing it according to the paradigm of parallel high-performance computing. However, the resources of devices and sensors at the data generating side might not suffice to transmit all data. For instance, pervasive distributed systems such as wireless sensors networks are highly communication-constrained, as are streaming high throughput applications, or those where data masses are simply too huge to be sent over existing communication lines, like satellite connections. Hence, the IoT requires a new generation of distributed algorithms which are resource-aware and intelligently reduce the amount of data transmitted and processed throughout the analysis chain. This thesis deals with the distributed analysis of vertically partitioned sensor measurements under communication constraints, which is a particularly challenging scenario. Here, not observations are distributed over nodes, but their feature values. The learning of accurate prediction models may require the combination of information from different nodes, necessarily leading to communication. The main question is how to design communication-efficient algorithms for the scenario, while at the same time preserving sufficient accuracy. The first part of the thesis introduces fundamental concepts. An overview of the IoT and its many applications is given, with a special focus on data analysis, the vertically partitioned data scenario, and accompanying research questions. Then, basic notions of machine learning and data mining are introduced. A selection of existing distributed data mining approaches is presented and discussed in more detail. Distributed learning in the vertically partitioned data scenario is then motivated by a smart manufacturing case study. In a hot rolling mill, different machines assess parameters describing the processing of single steel blocks, whose quality should be predicted as early as possible, by analysis of distributed measurements. Each machine creates not single value series, but many of them. Their heterogeneity leads to challenging questions concerning the steps of preprocessing and finding a good representation for learning, for which solutions are proposed. Another problem is that quality information is not given for individual blocks, but charges of blocks. How can we nevertheless predict the quality of individual blocks? Time constraints lead to questions typical for the vertically partitioned data scenario. Which data should be analyzed locally, to match the constraints, and which should be sent to a central server? Learning from aggregated label information is a relatively novel problem in machine learning research. A new algorithm for the task is developed and evaluated, the Learning from Label Proportions by Clustering (LLPC) algorithm. The algorithm's performance is compared to three other state-of-the-art approaches, in terms of accuracy and running time. It can be shown that LLPC achieves results with lower running time, while accuracy is comparable to that of its competitors, or significantly higher. The proposed algorithm comes with many other benefits, like ease of implementation and a small memory footprint. For highly decentralized systems, the Training of Local Models from (Label) Counts (TLMC) algorithm is proposed. The method builds on LLPC, reducing communication by transferring only label counts for batches of observations between nodes. Feasibility of the approach is demonstrated by evaluating the algorithm's performance in the context of traffic flow prediction. It is shown that TLMC is much more communication-efficient than centralization of all data, but that accuracy can nevertheless compete with that of a centrally trained global model. Finally, a communication-efficient distributed algorithm for anomaly detection is proposed, the Vertically Distributed Core Vector Machine (VDCVM). It can be shown that the proposed algorithm communicates up to an order of magnitude less data during learning, in comparison to another state-of-the-art approach, or training a global model by the centralization of all data. Nevertheless, in many relevant cases, the VDCVM achieves similar or even higher accuracy on several controlled and benchmark datasets. A main result of the thesis is that communication-efficient learning is possible in cases where features from different nodes are conditionally independent, given the target value to be predicted. Most efficient are local models, which exchange label information between nodes. In comparison to consensus algorithms, which transmit labels repeatedly, TLMC sends labels only once between nodes. Communication could be even reduced further by learning from counts of labels. In the context of traffic flow prediction, the accuracy achieved is still sufficient in comparison to centralizing all data and training a global model. In the case of anomaly detection, similar results could be achieved by utilizing a sampling approach which draws only as many observations as needed to reach a (1+ε)-approximation of the minimum enclosing ball (MEB). The developed approaches have many applications in communication-constrained settings, in the sectors mentioned above. It has been shown that data can be reduced and learned from before it even enters the cloud. Decentralized processing might thus enable the analysis of big data masses, the more devices are getting connected to the IoT

    Challenges for Data Mining on Sensor Data of Interlinked Processes

    Get PDF
    In industries like steel production, interlinked production processes leave no time for assessing the physical quality of intermediate products. Failures during the process can lead to high internal costs when already defective products are passed through the entire value chain. However, process data like machine parameters and sensor data which are di- rectly linked to quality can be recorded. Based on a rolling mill case study, the paper discusses how decentralized data mining and intelligent machine-to-machine communication could be used to predict the physical quality of intermediate products online and in real-time for detecting quality issues as early as possible. The recording of huge data masses and the distributed but sequential nature of the problem lead to challenging research questions for the next generation of data mining

    Exploring Machine Learning Models for Federated Learning: A Review of Approaches, Performance, and Limitations

    Full text link
    In the growing world of artificial intelligence, federated learning is a distributed learning framework enhanced to preserve the privacy of individuals' data. Federated learning lays the groundwork for collaborative research in areas where the data is sensitive. Federated learning has several implications for real-world problems. In times of crisis, when real-time decision-making is critical, federated learning allows multiple entities to work collectively without sharing sensitive data. This distributed approach enables us to leverage information from multiple sources and gain more diverse insights. This paper is a systematic review of the literature on privacy-preserving machine learning in the last few years based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Specifically, we have presented an extensive review of supervised/unsupervised machine learning algorithms, ensemble methods, meta-heuristic approaches, blockchain technology, and reinforcement learning used in the framework of federated learning, in addition to an overview of federated learning applications. This paper reviews the literature on the components of federated learning and its applications in the last few years. The main purpose of this work is to provide researchers and practitioners with a comprehensive overview of federated learning from the machine learning point of view. A discussion of some open problems and future research directions in federated learning is also provided

    Security in Data Mining-A Comprehensive Survey

    Get PDF
    Data mining techniques, while allowing the individuals to extract hidden knowledge on one hand, introduce a number of privacy threats on the other hand. In this paper, we study some of these issues along with a detailed discussion on the applications of various data mining techniques for providing security. An efficient classification technique when used properly, would allow an user to differentiate between a phishing website and a normal website, to classify the users as normal users and criminals based on their activities on Social networks (Crime Profiling) and to prevent users from executing malicious codes by labelling them as malicious. The most important applications of Data mining is the detection of intrusions, where different Data mining techniques can be applied to effectively detect an intrusion and report in real time so that necessary actions are taken to thwart the attempts of the intruder

    An Efficient Deep-Learning-Based Detection and Classification System for Cyber-Attacks in IoT Communication Networks

    Get PDF
    With the rapid expansion of intelligent resource-constrained devices and high-speed communication technologies, the Internet of Things (IoT) has earned wide recognition as the primary standard for low-power lossy networks (LLNs). Nevertheless, IoT infrastructures are vulnerable to cyber-attacks due to the constraints in computation, storage, and communication capacity of the endpoint devices. From one side, the majority of newly developed cyber-attacks are formed by slightly mutating formerly established cyber-attacks to produce a new attack that tends to be treated as normal traffic through the IoT network. From the other side, the influence of coupling the deep learning techniques with the cybersecurity field has become a recent inclination of many security applications due to their impressive performance. In this paper, we provide the comprehensive development of a new intelligent and autonomous deep-learning-based detection and classification system for cyber-attacks in IoT communication networks that leverage the power of convolutional neural networks, abbreviated as IoT-IDCS-CNN (IoT based Intrusion Detection and Classification System using Convolutional Neural Network). The proposed IoT-IDCS-CNN makes use of high-performance computing that employs the robust Compute Unified Device Architectures (CUDA) based Nvidia GPUs (Graphical Processing Units) and parallel processing that employs high-speed I9-core-based Intel CPUs. In particular, the proposed system is composed of three subsystems: a feature engineering subsystem, a feature learning subsystem, and a traffic classification subsystem. All subsystems were developed, verified, integrated, and validated in this research. To evaluate the developed system, we employed the Network Security Laboratory-Knowledge Discovery Databases (NSL-KDD) dataset, which includes all the key attacks in IoT computing. The simulation results demonstrated a greater than 99.3% and 98.2% cyber-attack classification accuracy for the binary-class classifier (normal vs. anomaly) and the multiclass classifier (five categories), respectively. The proposed system was validated using a K-fold cross-validation method and was evaluated using the confusion matrix parameters (i.e., true negative (TN), true positive (TP), false negative (FN), false positive (FP)), along with other classification performance metrics, including precision, recall, F1-score, and false alarm rate. The test and evaluation results of the IoT-IDCS-CNN system outperformed many recent machine-learning-based IDCS systems in the same area of study
    • …
    corecore