3 research outputs found

    Data Stream Classification Using Classifier Ensemble

    Get PDF
    For the contemporary business, the crucial factor is making smart decisions on the basis of the knowledge hidden in stored data. Unfortunately,m traditional simple methods of data analysis are not sufficient for efficient management of modern enterprizes, because they are not appropriate for the huge and growing amount of the data stored by them. Additionally data usually comes continuously in the form of so-called data stream. The great disadvantage of traditional classification methods is that they assume that statistical properties of the discovered concept are being unchanged, while in real situation, we could observe so-called concept drift, which could be caused by changes in the probabilities of classes or/and conditional probability distributions of classes. The potential for considering new training data is an important feature of machine learning methods used in security applications (spam filtering or intrusion detection) or decision support systems for marketing departments, which need to follow the changing client behavior. Unfortunately, the occurrence of concept drift dramatically decreases classification accuracy. This work presents the comprehensive study on the ensemble classifier approach applied to the problem of drifted data streams. Especially it reports the research on modifications of previously developed Weighted Aging Classifier Ensemble (WAE) algorithm, which is able to construct a valuable classifier ensemble for classification of incremental drifted stream data. We generalize WAE method and propose the general framework for this approach. Such framework can prune an classifier ensemble before or after assigning weights to individual classifiers. Additionally, we propose new classifier pruning criteria, weight calculation methods, and aging operators. We also propose rejuvenating operator, which is able to soften the aging effect, which could be useful, especially in the case if quite ”old” classifiers are high quality models, i.e., their presence increases ensemble accuracy, what could be found, e.g., in the case of recurring concept drift. The chosen characteristics of the proposed frameworks were evaluated on the basis of the wide range of computer experiments carried out on the two benchmark data streams. Obtained results confirmed the usability of proposed method to the data stream classification with the presence of incremental concept drift

    AMANDA : density-based adaptive model for nonstationary data under extreme verification latency scenarios

    Get PDF
    Gradual concept-drift refers to a smooth and gradual change in the relations between input and output data in the underlying distribution over time. The problem generates a model obsolescence and consequently a quality decrease in predictions. Besides, there is a challenging task during the stream: The extreme verification latency (EVL) to verify the labels. For batch scenarios, state-of-the-art methods propose an adaptation of a supervised model by using an unconstrained least squares importance fitting (uLSIF) algorithm or a semi-supervised approach along with a core support extraction (CSE) method. However, these methods do not properly tackle the mentioned problems due to their high computational time for large data volumes, lack in representing the right samples of the drift or even for having several parameters for tuning. Therefore, we propose a density-based adaptive model for nonstationary data (AMANDA), which uses a semi-supervised classifier along with a CSE method. AMANDA has two variations: AMANDA with a fixed cutting percentage (AMANDA-FCP); and AMANDA with a dynamic cutting percentage (AMANDADCP). Our results indicate that the two variations of AMANDA outperform the state-of-the-art methods for almost all synthetic datasets and real ones with an improvement up to 27.98% regarding the average error. We have found that the use of AMANDA-FCP improved the results for a gradual concept-drift even with a small size of initial labeled data. Moreover, our results indicate that SSL classifiers are improved when they work along with our static or dynamic CSE methods. Therefore, we emphasize the importance of research directions based on this approach.Concept-drift gradual refere-se à mudança suave e gradual na distribuição dos dados conforme o tempo passa. Este problema causa obsolescência no modelo de aprendizado e queda na qualidade das previsões. Além disso, existe um complicador durante o processamento dos dados: a latência de verificação extrema (LVE) para se verificar os rótulos. Métodos do estado da arte propõem uma adaptação do modelo supervisionado usando uma abordagem de estimação de importância baseado em mínimos quadrados ou usando uma abordagem semi-supervisionada em conjunto com a extração de instâncias centrais, na sigla em inglês (CSE). Entretanto, estes métodos não tratam adequadamente os problemas mencionados devido ao fato de requererem alto tempo computacional para processar grandes volumes de dados, falta de correta seleção das instâncias que representam a mudança da distribuição, ou ainda por demandarem o ajuste de grande quantidade de parâmetros. Portanto, propomos um modelo adaptativo baseado em densidades para dados não-estacionários (AMANDA), que tem como base um classificador semi-supervisionado e um método CSE baseado em densidade. AMANDA tem duas variações: percentual de corte fixo (AMANDAFCP); e percentual de corte dinâmico (AMANDA-DCP). Nossos resultados indicam que as duas variações da proposta superam o estado da arte em quase todas as bases de dados sintéticas e reais em até 27,98% em relação ao erro médio. Concluímos que a aplicação do método AMANDA-FCP faz com que a classificação melhore mesmo quando há uma pequena porção inicial de dados rotulados. Mais ainda, os classificadores semi-supervisionados são melhorados quando trabalham em conjunto com nossos métodos de CSE, estático ou dinâmico

    3D Classification of Power Line Scene Using Airborne Lidar Data

    Get PDF
    Failure to adequately maintain vegetation within a power line corridor has been identified as a main cause of the August 14, 2003 electric power blackout. Such that, timely and accurate corridor mapping and monitoring are indispensible to mitigate such disaster. Moreover, airborne LiDAR (Light Detection And Ranging) has been recently introduced and widely utilized in industries and academies thanks to its potential to automate the data processing for scene analysis including power line corridor mapping. However, today’s corridor mapping practice using LiDAR in industries still remains an expensive manual process that is not suitable for the large-scale, rapid commercial compilation of corridor maps. Additionally, in academies only few studies have developed algorithms capable of recognizing corridor objects in the power line scene, which are mostly based on 2-dimensional classification. Thus, the objective of this dissertation is to develop a 3-dimensional classification system which is able to automatically identify key objects in the power line corridor from large-scale LiDAR data. This dissertation introduces new features for power structures, especially for the electric pylon, and existing features which are derived through diverse piecewise (i.e., point, line and plane) feature extraction, and then constructs a classification model pool by building individual models according to the piecewise feature sets and diverse voltage training samples using Random Forests. Finally, this dissertation proposes a Multiple Classifier System (MCS) which provides an optimal committee of models from the model pool for classification of new incoming power line scene. The proposed MCS has been tested on a power line corridor where medium voltage transmission lines (115 kV and 230 kV) pass. The classification results based on the MCS applied by optimally selecting the pre-built classification models according to the voltage type of the test corridor demonstrate a good accuracy (89.07%) and computationally effective time cost (approximately 4 hours/km) without additional training fees
    corecore