9 research outputs found

    Evolving Large-Scale Data Stream Analytics based on Scalable PANFIS

    Full text link
    Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot cope with the data stream problems. In fact, large-scale data are mostly generated by the non-stationary data stream where its pattern evolves over time. To address this problem, we propose a novel Evolving Large-scale Data Stream Analytics framework based on a Scalable Parsimonious Network based on Fuzzy Inference System (Scalable PANFIS), where the PANFIS evolving algorithm is distributed over the worker nodes in the cloud to learn large-scale data stream. Scalable PANFIS framework incorporates the active learning (AL) strategy and two model fusion methods. The AL accelerates the distributed learning process to generate an initial evolving large-scale data stream model (initial model), whereas the two model fusion methods aggregate an initial model to generate the final model. The final model represents the update of current large-scale data knowledge which can be used to infer future data. Extensive experiments on this framework are validated by measuring the accuracy and running time of four combinations of Scalable PANFIS and other Spark-based built in algorithms. The results indicate that Scalable PANFIS with AL improves the training time to be almost two times faster than Scalable PANFIS without AL. The results also show both rule merging and the voting mechanisms yield similar accuracy in general among Scalable PANFIS algorithms and they are generally better than Spark-based algorithms. In terms of running time, the Scalable PANFIS training time outperforms all Spark-based algorithms when classifying numerous benchmark datasets.Comment: 20 pages, 5 figure

    Evolving Ensemble Fuzzy Classifier

    Full text link
    The concept of ensemble learning offers a promising avenue in learning from data streams under complex environments because it addresses the bias and variance dilemma better than its single model counterpart and features a reconfigurable structure, which is well suited to the given context. While various extensions of ensemble learning for mining non-stationary data streams can be found in the literature, most of them are crafted under a static base classifier and revisits preceding samples in the sliding window for a retraining step. This feature causes computationally prohibitive complexity and is not flexible enough to cope with rapidly changing environments. Their complexities are often demanding because it involves a large collection of offline classifiers due to the absence of structural complexities reduction mechanisms and lack of an online feature selection mechanism. A novel evolving ensemble classifier, namely Parsimonious Ensemble pENsemble, is proposed in this paper. pENsemble differs from existing architectures in the fact that it is built upon an evolving classifier from data streams, termed Parsimonious Classifier pClass. pENsemble is equipped by an ensemble pruning mechanism, which estimates a localized generalization error of a base classifier. A dynamic online feature selection scenario is integrated into the pENsemble. This method allows for dynamic selection and deselection of input features on the fly. pENsemble adopts a dynamic ensemble structure to output a final classification decision where it features a novel drift detection scenario to grow the ensemble structure. The efficacy of the pENsemble has been numerically demonstrated through rigorous numerical studies with dynamic and evolving data streams where it delivers the most encouraging performance in attaining a tradeoff between accuracy and complexity.Comment: this paper has been published by IEEE Transactions on Fuzzy System

    Discovering three-dimensional patterns in real-time from data streams: An online triclustering approach

    Get PDF
    Triclustering algorithms group sets of coordinates of 3-dimensional datasets. In this paper, a new triclustering approach for data streams is introduced. It follows a streaming scheme of learning in two steps: offline and online phases. First, the offline phase provides a sum mary model with the components of the triclusters. Then, the second stage is the online phase to deal with data in streaming. This online phase consists in using the summary model obtained in the offline stage to update the triclusters as fast as possible with genetic operators. Results using three types of synthetic datasets and a real-world environmental sensor dataset are reported. The performance of the proposed triclustering streaming algo rithm is compared to a batch triclustering algorithm, showing an accurate performance both in terms of quality and running timesMinisterio de Ciencia, Innovación y Universidades TIN2017-88209-C

    An Incremental Construction of Deep Neuro Fuzzy System for Continual Learning of Non-stationary Data Streams

    Full text link
    Existing FNNs are mostly developed under a shallow network configuration having lower generalization power than those of deep structures. This paper proposes a novel self-organizing deep FNN, namely DEVFNN. Fuzzy rules can be automatically extracted from data streams or removed if they play limited role during their lifespan. The structure of the network can be deepened on demand by stacking additional layers using a drift detection method which not only detects the covariate drift, variations of input space, but also accurately identifies the real drift, dynamic changes of both feature space and target space. DEVFNN is developed under the stacked generalization principle via the feature augmentation concept where a recently developed algorithm, namely gClass, drives the hidden layer. It is equipped by an automatic feature selection method which controls activation and deactivation of input attributes to induce varying subsets of input features. A deep network simplification procedure is put forward using the concept of hidden layer merging to prevent uncontrollable growth of dimensionality of input space due to the nature of feature augmentation approach in building a deep network structure. DEVFNN works in the sample-wise fashion and is compatible for data stream applications. The efficacy of DEVFNN has been thoroughly evaluated using seven datasets with non-stationary properties under the prequential test-then-train protocol. It has been compared with four popular continual learning algorithms and its shallow counterpart where DEVFNN demonstrates improvement of classification accuracy. Moreover, it is also shown that the concept drift detection method is an effective tool to control the depth of network structure while the hidden layer merging scenario is capable of simplifying the network complexity of a deep network with negligible compromise of generalization performance.Comment: This paper has been published in IEEE Transactions on Fuzzy System

    Deep Stacked Stochastic Configuration Networks for Lifelong Learning of Non-Stationary Data Streams

    Full text link
    The concept of SCN offers a fast framework with universal approximation guarantee for lifelong learning of non-stationary data streams. Its adaptive scope selection property enables for proper random generation of hidden unit parameters advancing conventional randomized approaches constrained with a fixed scope of random parameters. This paper proposes deep stacked stochastic configuration network (DSSCN) for continual learning of non-stationary data streams which contributes two major aspects: 1) DSSCN features a self-constructing methodology of deep stacked network structure where hidden unit and hidden layer are extracted automatically from continuously generated data streams; 2) the concept of SCN is developed to randomly assign inverse covariance matrix of multivariate Gaussian function in the hidden node addition step bypassing its computationally prohibitive tuning phase. Numerical evaluation and comparison with prominent data stream algorithms under two procedures: periodic hold-out and prequential test-then-train processes demonstrate the advantage of proposed methodology.Comment: This paper has been published in Information Science

    Incremental learning algorithms and applications

    Get PDF
    International audienceIncremental learning refers to learning from streaming data, which arrive over time, with limited memory resources and, ideally, without sacrificing model accuracy. This setting fits different application scenarios where lifelong learning is relevant, e.g. due to changing environments , and it offers an elegant scheme for big data processing by means of its sequential treatment. In this contribution, we formalise the concept of incremental learning, we discuss particular challenges which arise in this setting, and we give an overview about popular approaches, its theoretical foundations, and applications which emerged in the last years

    Development of advanced autonomous learning algorithms for nonlinear system identification and control

    Full text link
    Identification of nonlinear dynamical systems, data stream analysis, etc. is usually handled by autonomous learning algorithms like evolving fuzzy and evolving neuro-fuzzy systems (ENFSs). They are characterized by the single-pass learning mode and open structure-property. Such features enable their effective handling of fast and rapidly changing natures of data streams. The underlying bottleneck of ENFSs lies in its design principle, which involves a high number of free parameters (rule premise and rule consequent) to be adapted in the training process. This figure can even double in the case of the type-2 fuzzy system. From this literature gap, a novel ENFS, namely Parsimonious Learning Machine (PALM) is proposed in this thesis. To reduce the number of network parameters significantly, PALM features utilization of a new type of fuzzy rule based on the concept of hyperplane clustering, where it has no rule premise parameters. PALM is proposed in both type-1 and type-2 fuzzy systems where all of them characterize a fully dynamic rule-based system. Thus, it is capable of automatically generating, merging, and tuning the hyperplane-based fuzzy rule in a single-pass manner. Moreover, an extension of PALM, namely recurrent PALM (rPALM), is proposed and adopts the concept of teacher-forcing mechanism in the deep learning literature. The efficacy of both PALM and rPALM have been evaluated through numerical study with data streams and to identify nonlinear unmanned aerial vehicle system. The proposed models showcase significant improvements in terms of computational complexity and the number of required parameters against several renowned ENFSs while attaining comparable and often better predictive accuracy. The ENFSs have also been utilized to develop three autonomous intelligent controllers (AICons) in this thesis. They are namely Generic (G) controller, Parsimonious controller (PAC), and Reduced Parsimonious Controller (RedPAC). All these controllers start operating from scratch with an empty set of fuzzy rules, and no offline training is required. To cope with the dynamic behavior of the plant, these controllers can add, merge or prune the rules on demand. Among three AICons, the G-controller is built by utilizing an advanced incremental learning machine, namely Generic Evolving Neuro-Fuzzy Inference System. The integration of generalized adaptive resonance theory provides a compact structure of the G-controller. Consequently, the faster evolution of structure is witnessed, which lowers its computational cost. Another AICon namely, PAC is rooted with PALM's architecture. Since PALM has a dependency on user-defined thresholds to adapt the structure, these thresholds are replaced with the concept of bias- variance trade-off in PAC. In RedPAC, the network parameters have further reduced in contrast with PALM-based PAC, where the number of consequent parameters has reduced to one parameter per rule. These AICons work with very minor expert domain knowledge and developed by incorporating the sliding mode control technique. In G-controller and RedPAC, the control law and adaptation laws for the consequent parameters are derived from the SMC algorithm to establish a stable closed-loop system, where the stability of these controllers are guaranteed by using the Lyapunov function and the uniform asymptotic convergence of tracking error to zero is witnessed through the implication of an auxiliary robustifying control term. While using PAC, the boundedness and convergence of the closed-loop control system's tracking error and the controller's consequent parameters are confirmed by utilizing the LaSalle-Yoshizawa theorem. Their efficacy is evaluated by observing various trajectory tracking performance of unmanned aerial vehicles. The accuracy of these controllers is comparable or better than the benchmark controllers where the proposed controllers incur significantly fewer parameters to attain similar or better tracking performance

    Scaffolding type-2 classifier for incremental learning under concept drifts

    Full text link
    © 2016 Elsevier B.V. The proposal of a meta-cognitive learning machine that embodies the three pillars of human learning: what-to-learn, how-to-learn, and when-to-learn, has enriched the landscape of evolving systems. The majority of meta-cognitive learning machines in the literature have not, however, characterized a plug-and-play working principle, and thus require supplementary learning modules to be pre-or post-processed. In addition, they still rely on the type-1 neuron, which has problems of uncertainty. This paper proposes the Scaffolding Type-2 Classifier (ST2Class). ST2Class is a novel meta-cognitive scaffolding classifier that operates completely in local and incremental learning modes. It is built upon a multivariable interval type-2 Fuzzy Neural Network (FNN) which is driven by multivariate Gaussian function in the hidden layer and the non-linear wavelet polynomial in the output layer. The what-to-learn module is created by virtue of a novel active learning scenario termed the uncertainty measure; the how-to-learn module is based on the renowned Schema and Scaffolding theories; and the when-to-learn module uses a standard sample reserved strategy. The viability of ST2Class is numerically benchmarked against state-of-the-art classifiers in 12 data streams, and is statistically validated by thorough statistical tests, in which it achieves high accuracy while retaining low complexity

    Técnicas big data para el procesamiento de flujos de datos masivos en tiempo real

    Get PDF
    Programa de Doctorado en Biotecnología, Ingeniería y Tecnología QuímicaLínea de Investigación: Ingeniería, Ciencia de Datos y BioinformáticaClave Programa: DBICódigo Línea: 111Machine learning techniques have become one of the most demanded resources by companies due to the large volume of data that surrounds us in these days. The main objective of these technologies is to solve complex problems in an automated way using data. One of the current perspectives of machine learning is the analysis of continuous flows of data or data streaming. This approach is increasingly requested by enterprises as a result of the large number of information sources producing time-indexed data at high frequency, such as sensors, Internet of Things devices, social networks, etc. However, nowadays, research is more focused on the study of historical data than on data received in streaming. One of the main reasons for this is the enormous challenge that this type of data presents for the modeling of machine learning algorithms. This Doctoral Thesis is presented in the form of a compendium of publications with a total of 10 scientific contributions in International Conferences and journals with high impact index in the Journal Citation Reports (JCR). The research developed during the PhD Program focuses on the study and analysis of real-time or streaming data through the development of new machine learning algorithms. Machine learning algorithms for real-time data consist of a different type of modeling than the traditional one, where the model is updated online to provide accurate responses in the shortest possible time. The main objective of this Doctoral Thesis is the contribution of research value to the scientific community through three new machine learning algorithms. These algorithms are big data techniques and two of them work with online or streaming data. In this way, contributions are made to the development of one of the current trends in Artificial Intelligence. With this purpose, algorithms are developed for descriptive and predictive tasks, i.e., unsupervised and supervised learning, respectively. Their common idea is the discovery of patterns in the data. The first technique developed during the dissertation is a triclustering algorithm to produce three-dimensional data clusters in offline or batch mode. This big data algorithm is called bigTriGen. In a general way, an evolutionary metaheuristic is used to search for groups of data with similar patterns. The model uses genetic operators such as selection, crossover, mutation or evaluation operators at each iteration. The goal of the bigTriGen is to optimize the evaluation function to achieve triclusters of the highest possible quality. It is used as the basis for the second technique implemented during the Doctoral Thesis. The second algorithm focuses on the creation of groups over three-dimensional data received in real-time or in streaming. It is called STriGen. Streaming modeling is carried out starting from an offline or batch model using historical data. As soon as this model is created, it starts receiving data in real-time. The model is updated in an online or streaming manner to adapt to new streaming patterns. In this way, the STriGen is able to detect concept drifts and incorporate them into the model as quickly as possible, thus producing triclusters in real-time and of good quality. The last algorithm developed in this dissertation follows a supervised learning approach for time series forecasting in real-time. It is called StreamWNN. A model is created with historical data based on the k-nearest neighbor or KNN algorithm. Once the model is created, data starts to be received in real-time. The algorithm provides real-time predictions of future data, keeping the model always updated in an incremental way and incorporating streaming patterns identified as novelties. The StreamWNN also identifies anomalous data in real-time allowing this feature to be used as a security measure during its application. The developed algorithms have been evaluated with real data from devices and sensors. These new techniques have demonstrated to be very useful, providing meaningful triclusters and accurate predictions in real time.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e informátic
    corecore