5,134 research outputs found

    Online feature extraction based on accelerated kernel principal component analysis for data stream

    Get PDF
    Kernel principal component analysis (KPCA) is known as a nonlinear feature extraction method. Takeuchi et al. have proposed an incremental type of KPCA (IKPCA) that can update an eigen-space incrementally for a sequence of data. However, in IKPCA, the eigenvalue decomposition should be carried out for every single data, even though a chunk of data is given at one time. To reduce the computational costs in learning chunk data, this paper proposes an extended IKPCA called Chunk IKPCA (CIKPCA) where a chunk of multiple data is learned with single eigenvalue decomposition. For a large data chunk, to reduce further computation time and memory usage, it is first divided into several smaller chunks, and only useful data are selected based on the accumulation ratio. In the proposed CIKPCA, a small set of independent data are first selected from a reduced set of data so that eigenvectors in a high-dimensional feature space can be represented as a linear combination of such independent data. Then, the eigenvectors are incrementally updated by keeping only an eigenspace model that consists of the sextuplet such as independent data, coefficients, eigenvalues, and mean information. The proposed CIKPCA can augment an eigen-feature space based on the accumulation ratio that can also be updated without keeping all the past data, and the eigen-feature space is rotated by solving an eigenvalue problem once for each data chunk. The experiment results show that the learning time of the proposed CIKPCA is greatly reduced as compared with KPCA and IKPCA without sacrificing recognition accuracy

    Incremental Sparse-PCA Feature Extraction For Data Streams

    Get PDF
    Intruders attempt to penetrate commercial systems daily and cause considerable financial losses for individuals and organizations. Intrusion detection systems monitor network events to detect computer security threats. An extensive amount of network data is devoted to detecting malicious activities. Storing, processing, and analyzing the massive volume of data is costly and indicate the need to find efficient methods to perform network data reduction that does not require the data to be first captured and stored. A better approach allows the extraction of useful variables from data streams in real time and in a single pass. The removal of irrelevant attributes reduces the data to be fed to the intrusion detection system (IDS) and shortens the analysis time while improving the classification accuracy. This dissertation introduces an online, real time, data processing method for knowledge extraction. This incremental feature extraction is based on two approaches. First, Chunk Incremental Principal Component Analysis (CIPCA) detects intrusion in data streams. Then, two novel incremental feature extraction methods, Incremental Structured Sparse PCA (ISSPCA) and Incremental Generalized Power Method Sparse PCA (IGSPCA), find malicious elements. Metrics helped compare the performance of all methods. The IGSPCA was found to perform as well as or better than CIPCA overall in term of dimensionality reduction, classification accuracy, and learning time. ISSPCA yielded better results for higher chunk values and greater accumulation ratio thresholds. CIPCA and IGSPCA reduced the IDS dataset to 10 principal components as opposed to 14 eigenvectors for ISSPCA. ISSPCA is more expensive in terms of learning time in comparison to the other techniques. This dissertation presents new methods that perform feature extraction from continuous data streams to find the small number of features necessary to express the most data variance. Data subsets derived from a few important variables render their interpretation easier. Another goal of this dissertation was to propose incremental sparse PCA algorithms capable to process data with concept drift and concept shift. Experiments using WaveForm and WaveFormNoise datasets confirmed this ability. Similar to CIPCA, the ISSPCA and IGSPCA updated eigen-axes as a function of the accumulation ratio value, forming informative eigenspace with few eigenvectors

    Dynamic Data Mining: Methodology and Algorithms

    No full text
    Supervised data stream mining has become an important and challenging data mining task in modern organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions. To address these three challenges, this thesis proposes the novel dynamic data mining (DDM) methodology by effectively applying supervised ensemble models to data stream mining. DDM can be loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired by the idea that although the underlying concepts in a data stream are time-varying, their distinctions can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in order to classify incoming examples of similar concepts. First, following the general paradigm of DDM, we examine the different concept-drifting stream mining scenarios and propose corresponding effective and efficient data mining algorithms. • To address concept drift caused merely by changes of variable distributions, which we term pseudo concept drift, base models built on categorized streaming data are organized and selected in line with their corresponding variable distribution characteristics. • To address concept drift caused by changes of variable and class joint distributions, which we term true concept drift, an effective data categorization scheme is introduced. A group of working models is dynamically organized and selected for reacting to the drifting concept. Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce easily six effective algorithms for mining data streams with skewed class distributions. In addition, we also introduce a new ensemble model approach for batch learning, following the same methodology. Both theoretical and empirical studies demonstrate its effectiveness. Future work would be targeted at improving the effectiveness and efficiency of the proposed algorithms. Meantime, we would explore the possibilities of using the integration framework to solve other open stream mining research problems

    Towards On-line Domain-Independent Big Data Learning: Novel Theories and Applications

    Get PDF
    Feature extraction is an extremely important pre-processing step to pattern recognition, and machine learning problems. This thesis highlights how one can best extract features from the data in an exhaustively online and purely adaptive manner. The solution to this problem is given for both labeled and unlabeled datasets, by presenting a number of novel on-line learning approaches. Specifically, the differential equation method for solving the generalized eigenvalue problem is used to derive a number of novel machine learning and feature extraction algorithms. The incremental eigen-solution method is used to derive a novel incremental extension of linear discriminant analysis (LDA). Further the proposed incremental version is combined with extreme learning machine (ELM) in which the ELM is used as a preprocessor before learning. In this first key contribution, the dynamic random expansion characteristic of ELM is combined with the proposed incremental LDA technique, and shown to offer a significant improvement in maximizing the discrimination between points in two different classes, while minimizing the distance within each class, in comparison with other standard state-of-the-art incremental and batch techniques. In the second contribution, the differential equation method for solving the generalized eigenvalue problem is used to derive a novel state-of-the-art purely incremental version of slow feature analysis (SLA) algorithm, termed the generalized eigenvalue based slow feature analysis (GENEIGSFA) technique. Further the time series expansion of echo state network (ESN) and radial basis functions (EBF) are used as a pre-processor before learning. In addition, the higher order derivatives are used as a smoothing constraint in the output signal. Finally, an online extension of the generalized eigenvalue problem, derived from James Stone’s criterion, is tested, evaluated and compared with the standard batch version of the slow feature analysis technique, to demonstrate its comparative effectiveness. In the third contribution, light-weight extensions of the statistical technique known as canonical correlation analysis (CCA) for both twinned and multiple data streams, are derived by using the same existing method of solving the generalized eigenvalue problem. Further the proposed method is enhanced by maximizing the covariance between data streams while simultaneously maximizing the rate of change of variances within each data stream. A recurrent set of connections used by ESN are used as a pre-processor between the inputs and the canonical projections in order to capture shared temporal information in two or more data streams. A solution to the problem of identifying a low dimensional manifold on a high dimensional dataspace is then presented in an incremental and adaptive manner. Finally, an online locally optimized extension of Laplacian Eigenmaps is derived termed the generalized incremental laplacian eigenmaps technique (GENILE). Apart from exploiting the benefit of the incremental nature of the proposed manifold based dimensionality reduction technique, most of the time the projections produced by this method are shown to produce a better classification accuracy in comparison with standard batch versions of these techniques - on both artificial and real datasets

    Incremental Linear Discriminant analysis for classification of Data Streams

    Get PDF
    This paper presents a constructive method for deriving an updated discriminant eigenspace for classification when bursts of data that contains new classes is being added to an initial discriminant eigenspace in the form of random chunks. Basically, we propose an incremental linear discriminant analysis (ILDA) in its two forms: a sequential ILDA and a Chunk ILDA. In experiments, we have tested ILDA using datasets with a small number of classes and small-dimensional features, as well as datasets with a large number of classes and large-dimensional features. We have compared the proposed ILDA against the traditional batch LDA in terms of discriminability, execution time and memory usage with the increasing volume of data addition. The results show that the proposed ILDA can effectively evolve a discriminant eigenspace over a fast and large data stream, and extract features with superior discriminability in classification, when compared with other methods. © 2005 IEEE

    The Virtues of Frugality - Why cosmological observers should release their data slowly

    Get PDF
    Cosmologists will soon be in a unique position. Observational noise will gradually be replaced by cosmic variance as the dominant source of uncertainty in an increasing number of observations. We reflect on the ramifications for the discovery and verification of new models. If there are features in the full data set that call for a new model, there will be no subsequent observations to test that model's predictions. We give specific examples of the problem by discussing the pitfalls of model discovery by prior adjustment in the context of dark energy models and inflationary theories. We show how the gradual release of data can mitigate this difficulty, allowing anomalies to be identified, and new models to be proposed and tested. We advocate that observers plan for the frugal release of data from future cosmic variance limited observations.Comment: 5 pages, expanded discussion of Lambda and of blind anlysis, added refs. Matches version to appear in MNRAS Letter
    • …
    corecore