889 research outputs found

    A Family of Joint Sparse PCA Algorithms for Anomaly Localization in Network Data Streams

    Get PDF
    Determining anomalies in data streams that are collected and transformed from various types of networks has recently attracted significant research interest. Principal Component Analysis (PCA) is arguably the most widely applied unsupervised anomaly detection technique for networked data streams due to its simplicity and efficiency. However, none of existing PCA based approaches addresses the problem of identifying the sources that contribute most to the observed anomaly, or anomaly localization. In this paper, we first proposed a novel joint sparse PCA method to perform anomaly detection and localization for network data streams. Our key observation is that we can detect anomalies and localize anomalous sources by identifying a low dimensional abnormal subspace that captures the abnormal behavior of data. To better capture the sources of anomalies, we incorporated the structure of the network stream data in our anomaly localization framework. Also, an extended version of PCA, multidimensional KLE, was introduced to stabilize the localization performance. We performed comprehensive experimental studies on four real-world data sets from different application domains and compared our proposed techniques with several state-of-the-arts. Our experimental studies demonstrate the utility of the proposed methods

    Learning from Structured Data with High Dimensional Structured Input and Output Domain

    Get PDF
    Structured data is accumulated rapidly in many applications, e.g. Bioinformatics, Cheminformatics, social network analysis, natural language processing and text mining. Designing and analyzing algorithms for handling these large collections of structured data has received significant interests in data mining and machine learning communities, both in the input and output domain. However, it is nontrivial to adopt traditional machine learning algorithms, e.g. SVM, linear regression to structured data. For one thing, the structural information in the input domain and output domain is ignored if applying the normal algorithms to structured data. For another, the major challenge in learning from many high-dimensional structured data is that input/output domain can contain tens of thousands even larger number of features and labels. With the high dimensional structured input space and/or structured output space, learning a low dimensional and consistent structured predictive function is important for both robustness and interpretability of the model. In this dissertation, we will present a few machine learning models that learn from the data with structured input features and structured output tasks. For learning from the data with structured input features, I have developed structured sparse boosting for graph classification, structured joint sparse PCA for anomaly detection and localization. Besides learning from structured input, I also investigated the interplay between structured input and output under the context of multi-task learning. In particular, I designed a multi-task learning algorithms that performs structured feature selection & task relationship Inference. We will demonstrate the applications of these structured models on subgraph based graph classification, networked data stream anomaly detection/localization, multiple cancer type prediction, neuron activity prediction and social behavior prediction. Finally, through my intern work at IBM T.J. Watson Research, I will demonstrate how to leverage structural information from mobile data (e.g. call detail record and GPS data) to derive important places from people's daily life for transit optimization and urban planning

    New Methods for Network Traffic Anomaly Detection

    Get PDF
    In this thesis we examine the efficacy of applying outlier detection techniques to understand the behaviour of anomalies in communication network traffic. We have identified several shortcomings. Our most finding is that known techniques either focus on characterizing the spatial or temporal behaviour of traffic but rarely both. For example DoS attacks are anomalies which violate temporal patterns while port scans violate the spatial equilibrium of network traffic. To address this observed weakness we have designed a new method for outlier detection based spectral decomposition of the Hankel matrix. The Hankel matrix is spatio-temporal correlation matrix and has been used in many other domains including climate data analysis and econometrics. Using our approach we can seamlessly integrate the discovery of both spatial and temporal anomalies. Comparison with other state of the art methods in the networks community confirms that our approach can discover both DoS and port scan attacks. The spectral decomposition of the Hankel matrix is closely tied to the problem of inference in Linear Dynamical Systems (LDS). We introduce a new problem, the Online Selective Anomaly Detection (OSAD) problem, to model the situation where the objective is to report new anomalies in the system and suppress know faults. For example, in the network setting an operator may be interested in triggering an alarm for malicious attacks but not on faults caused by equipment failure. In order to solve OSAD we combine techniques from machine learning and control theory in a unique fashion. Machine Learning ideas are used to learn the parameters of an underlying data generating system. Control theory techniques are used to model the feedback and modify the residual generated by the data generating state model. Experiments on synthetic and real data sets confirm that the OSAD problem captures a general scenario and tightly integrates machine learning and control theory to solve a practical problem

    Cross-product Penalized Component Analysis (XCAN)

    Full text link
    Matrix factorization methods are extensively employed to understand complex data. In this paper, we introduce the cross-product penalized component analysis (XCAN), a sparse matrix factorization based on the optimization of a loss function that allows a trade-off between variance maximization and structural preservation. The approach is based on previous developments, notably (i) the Sparse Principal Component Analysis (SPCA) framework based on the LASSO, (ii) extensions of SPCA to constrain both modes of the factorization, like co-clustering or the Penalized Matrix Decomposition (PMD), and (iii) the Group-wise Principal Component Analysis (GPCA) method. The result is a flexible modeling approach that can be used for data exploration in a large variety of problems. We demonstrate its use with applications from different disciplines

    Modeling and performance evaluation of stealthy false data injection attacks on smart grid in the presence of corrupted measurements

    Full text link
    The false data injection (FDI) attack cannot be detected by the traditional anomaly detection techniques used in the energy system state estimators. In this paper, we demonstrate how FDI attacks can be constructed blindly, i.e., without system knowledge, including topological connectivity and line reactance information. Our analysis reveals that existing FDI attacks become detectable (consequently unsuccessful) by the state estimator if the data contains grossly corrupted measurements such as device malfunction and communication errors. The proposed sparse optimization based stealthy attacks construction strategy overcomes this limitation by separating the gross errors from the measurement matrix. Extensive theoretical modeling and experimental evaluation show that the proposed technique performs more stealthily (has less relative error) and efficiently (fast enough to maintain time requirement) compared to other methods on IEEE benchmark test systems.Comment: Keywords: Smart grid, False data injection, Blind attack, Principal component analysis (PCA), Journal of Computer and System Sciences, Elsevier, 201

    Feature Space Modeling for Accurate and Efficient Learning From Non-Stationary Data

    Get PDF
    A non-stationary dataset is one whose statistical properties such as the mean, variance, correlation, probability distribution, etc. change over a specific interval of time. On the contrary, a stationary dataset is one whose statistical properties remain constant over time. Apart from the volatile statistical properties, non-stationary data poses other challenges such as time and memory management due to the limitation of computational resources mostly caused by the recent advancements in data collection technologies which generate a variety of data at an alarming pace and volume. Additionally, when the collected data is complex, managing data complexity, emerging from its dimensionality and heterogeneity, can pose another challenge for effective computational learning. The problem is to enable accurate and efficient learning from non-stationary data in a continuous fashion over time while facing and managing the critical challenges of time, memory, concept change, and complexity simultaneously. Feature space modeling is one of the most effective solutions to address this problem. For non-stationary data, selecting relevant features is even more critical than stationary data due to the reduction of feature dimension which can ensure the best use a computational resource to produce higher accuracy and efficiency by data mining algorithms. In this dissertation, we investigated a variety of feature space modeling techniques to improve the overall performance of data mining algorithms. In particular, we built Relief based feature sub selection method in combination with data complexity iv analysis to improve the classification performance using ovarian cancer image data collected in a non-stationary batch mode. We also collected time series health sensor data in a streaming environment and deployed feature space transformation using Singular Value Decomposition (SVD). This led to reduced dimensionality of feature space resulting in better accuracy and efficiency produced by Density Ration Estimation Method in identifying potential change points in data over time. We have also built an unsupervised feature space modeling using matrix factorization and Lasso Regression which was successfully deployed in conjugate with Relative Density Ratio Estimation to address the botnet attacks in a non-stationary environment. Relief based feature model improved 16% accuracy of Fuzzy Forest classifier. For change detection framework, we observed 9% improvement in accuracy for PCA feature transformation. Due to the unsupervised feature selection model, for 2% and 5% malicious traffic ratio, the proposed botnet detection framework exhibited average 20% better accuracy than One Class Support Vector Machine (OSVM) and average 25% better accuracy than Autoencoder. All these results successfully demonstrate the effectives of these feature space models. The fundamental theme that repeats itself in this dissertation is about modeling efficient feature space to improve both accuracy and efficiency of selected data mining models. Every contribution in this dissertation has been subsequently and successfully employed to capitalize on those advantages to solve real-world problems. Our work bridges the concepts from multiple disciplines ineffective and surprising ways, leading to new insights, new frameworks, and ultimately to a cross-production of diverse fields like mathematics, statistics, and data mining

    Incremental Sparse-PCA Feature Extraction For Data Streams

    Get PDF
    Intruders attempt to penetrate commercial systems daily and cause considerable financial losses for individuals and organizations. Intrusion detection systems monitor network events to detect computer security threats. An extensive amount of network data is devoted to detecting malicious activities. Storing, processing, and analyzing the massive volume of data is costly and indicate the need to find efficient methods to perform network data reduction that does not require the data to be first captured and stored. A better approach allows the extraction of useful variables from data streams in real time and in a single pass. The removal of irrelevant attributes reduces the data to be fed to the intrusion detection system (IDS) and shortens the analysis time while improving the classification accuracy. This dissertation introduces an online, real time, data processing method for knowledge extraction. This incremental feature extraction is based on two approaches. First, Chunk Incremental Principal Component Analysis (CIPCA) detects intrusion in data streams. Then, two novel incremental feature extraction methods, Incremental Structured Sparse PCA (ISSPCA) and Incremental Generalized Power Method Sparse PCA (IGSPCA), find malicious elements. Metrics helped compare the performance of all methods. The IGSPCA was found to perform as well as or better than CIPCA overall in term of dimensionality reduction, classification accuracy, and learning time. ISSPCA yielded better results for higher chunk values and greater accumulation ratio thresholds. CIPCA and IGSPCA reduced the IDS dataset to 10 principal components as opposed to 14 eigenvectors for ISSPCA. ISSPCA is more expensive in terms of learning time in comparison to the other techniques. This dissertation presents new methods that perform feature extraction from continuous data streams to find the small number of features necessary to express the most data variance. Data subsets derived from a few important variables render their interpretation easier. Another goal of this dissertation was to propose incremental sparse PCA algorithms capable to process data with concept drift and concept shift. Experiments using WaveForm and WaveFormNoise datasets confirmed this ability. Similar to CIPCA, the ISSPCA and IGSPCA updated eigen-axes as a function of the accumulation ratio value, forming informative eigenspace with few eigenvectors
    corecore