107 research outputs found

    Class imbalance ensemble learning based on the margin theory

    Get PDF
    The proportion of instances belonging to each class in a data-set plays an important role in machine learning. However, the real world data often suffer from class imbalance. Dealing with multi-class tasks with different misclassification costs of classes is harder than dealing with two-class ones. Undersampling and oversampling are two of the most popular data preprocessing techniques dealing with imbalanced data-sets. Ensemble classifiers have been shown to be more effective than data sampling techniques to enhance the classification performance of imbalanced data. Moreover, the combination of ensemble learning with sampling methods to tackle the class imbalance problem has led to several proposals in the literature, with positive results. The ensemble margin is a fundamental concept in ensemble learning. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. In this paper, we propose a novel ensemble margin based algorithm, which handles imbalanced classification by employing more low margin examples which are more informative than high margin samples. This algorithm combines ensemble learning with undersampling, but instead of balancing classes randomly such as UnderBagging, our method pays attention to constructing higher quality balanced sets for each base classifier. In order to demonstrate the effectiveness of the proposed method in handling class imbalanced data, UnderBagging and SMOTEBagging are used in a comparative analysis. In addition, we also compare the performances of different ensemble margin definitions, including both supervised and unsupervised margins, in class imbalance learning

    A Contribution to land cover and land use mapping: in Portugal with multi-temporal Sentinel-2 data and supervised classification

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Geographic Information Systems and ScienceRemote sensing techniques have been widely employed to map and monitor land cover and land use, important elements for the description of the environment. The current land cover and land use mapping paradigm takes advantage of a variety of data options with proper spatial, spectral and temporal resolutions along with advances in technology. This enabled the creation of automated data processing workflows integrated with classification algorithms to accurately map large areas with multi-temporal data. In Portugal, the General Directorate for Territory (DGT) is developing an operational Land Cover Monitoring System (SMOS), which includes an annual land cover cartography product (COSsim) based on an automatic process using supervised classification of multi-temporal Sentinel-2 data. In this context, a range of experiments are being conducted to improve map accuracy and classification efficiency. This study provides a contribution to DGT’s work. A classification of the biogeographic region of Trás-os-Montes in the North of Portugal was performed for the agricultural year of 2018 using Random Forest and an intra-annual multi-temporal Sentinel-2 dataset, with stratification of the study area and a combination of manually and automatically extracted training samples, with the latter being based on existing reference datasets. This classification was compared to a benchmark classification, conducted without stratification and with training data collected automatically only. In addition, an assessment of the influence of training sample size in classification accuracy was conducted. The main focus of this study was to investigate whether the use of vi classification uncertainty to create an improved training dataset could increase classification accuracy. A process of extracting additional training samples from areas of high classification uncertainty was conducted, then a new classification was performed and the results were compared. Classification accuracy assessment for all proposed experiments was conducted using the overall accuracy, precision, recall and F1-score. The use of stratification and combination of training strategies resulted in a classification accuracy of 66.7%, in contrast to 60.2% in the case of the benchmark classification. Despite the difference being considered not statistically significant, visual inspection of both maps indicated that stratification and introduction of manual training contributed to map land cover more accurately in some areas. Regarding the influence of sample size in classification accuracy, the results indicated a small difference, considered not statistically significant, in accuracy even after a reduction of over 90% in the sample size. This supports the findings of other studies which suggested that Random Forest has low sensitivity to variations in training sample size. However, the results might have been influenced by the training strategy employed, which uses spectral subclasses, thus creating spectral diversity in the samples independently of their size. With respect to the use of classification uncertainty to improve training sample, a slight increase of approximately 1% was observed, which was considered not statistically significant. This result could have been affected by limitations in the process of collecting additional sampling units for some classes, which resulted in a lack of additional training for some classes (eg. agriculture) and an overall imbalanced training dataset. Additionally, some classes had their additional training sampling units collected from a limited number of polygons, which could limit the spectral diversity of new samples. Nevertheless, visual inspection of the map suggested that the new training contributed to reduce confusion between some classes, improving map agreement with ground truth. Further investigation can be conducted to explore more deeply the potential of classification uncertainty, especially focusing on addressing problems related to the collection of the additional samples

    Tackling Uncertainties and Errors in the Satellite Monitoring of Forest Cover Change

    Get PDF
    This study aims at improving the reliability of automatic forest change detection. Forest change detection is of vital importance for understanding global land cover as well as the carbon cycle. Remote sensing and machine learning have been widely adopted for such studies with increasing degrees of success. However, contemporary global studies still suffer from lower-than-satisfactory accuracies and robustness problems whose causes were largely unknown. Global geographical observations are complex, as a result of the hidden interweaving geographical processes. Is it possible that some geographical complexities were not expected in contemporary machine learning? Could they cause uncertainties and errors when contemporary machine learning theories are applied for remote sensing? This dissertation adopts the philosophy of error elimination. We start by explaining the mathematical origins of possible geographic uncertainties and errors in chapter two. Uncertainties are unavoidable but might be mitigated. Errors are hidden but might be found and corrected. Then in chapter three, experiments are specifically designed to assess whether or not the contemporary machine learning theories can handle these geographic uncertainties and errors. In chapter four, we identify an unreported systemic error source: the proportion distribution of classes in the training set. A subsequent Bayesian Optimal solution is designed to combine Support Vector Machine and Maximum Likelihood. Finally, in chapter five, we demonstrate how this type of error is widespread not just in classification algorithms, but also embedded in the conceptual definition of geographic classes before the classification. In chapter six, the sources of errors and uncertainties and their solutions are summarized, with theoretical implications for future studies. The most important finding is that, how we design a classification largely pre-determines what we eventually get out of it. This applies for many contemporary popular classifiers including various types of neural nets, decision tree, and support vector machine. This is a cause of the so-called overfitting problem in contemporary machine learning. Therefore, we propose that the emphasis of classification work be shifted to the planning stage before the actual classification. Geography should not just be the analysis of collected observations, but also about the planning of observation collection. This is where geography, machine learning, and survey statistics meet

    Heterogeneous information fusion: combination of multiple supervised and unsupervised classification methods based on belief functions

    Get PDF
    International audienceIn real-life machine learning applications, a common problem is that raw data (e.g. remote sensing data) is sometimes inaccessible due to confidentiality and privacy constrains of corporations, making classification methods arduous to work in the supervised context. Moreover, even though raw data is accessible, limited labeled samples can also seriously affect supervised methods. Recently, supervised and unsupervised classification (clustering) results related to specific applications are published by more and more organizations. Therefore, combination of supervised classification and clustering results has gained increasing attention to improve the accuracy of supervised predictions. Incorporating clustering results with supervised classifications at the output level can help to lessen the recline on information at the raw data level, so that is pertinent to improve the accuracy for the applications when raw data is inaccessible or training samples are limited. We focus on the combination of multiple supervised classification and clustering results at the output level based on belief functions for three purposes: (1) to improve the accuracy of classification when raw data is inaccessible or training samples are highly limited; (2) to reduce uncertain and imprecise information in the supervised results; and (3) to study how supervised classification and clustering results affect the combination at the output level. Our contributions consist of a transformation method to transfer heterogeneous information into the same frame, and an iterative fusion strategy to retain most of the trustful information in multiple supervised classification and clustering results

    Contributions for the improvement of specific class mapping

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information Management, specialization in Geographic Information SystemsThe analysis of remotely sensed imagery has become a fundamental task for many environmental centred activities, not just scientific but also management related. In particular, the use of land cover maps depicting a particular study site is an integral part of many research projects, as they are not just a fundamental variable in environmental models but also base information supporting policy decisions. Land cover mapping assisted by supervised classification is today a staple tool of any analyst processing remotely sensed data, insomuch as these techniques allow users to map entire sites of interest in a omprehensive way. Many remote sensing projects are usually interested in a small number of land cover classes present in a study area and not in all classes that make-up the landscape. When focus is on a particular sub-set of classes of interest, conventional supervised classification may be sub-optimal for the discrimination of these specific target classes. The process of producing a non-exhaustive map, that is depicting only the classes of interest for the user, is called specific class mapping. This is the topic of this dissertation. Here, specific class mapping is examined to understand its origins, developments, adoption and current limitations. The main research goal is then to contribute for the understanding and improvement of this topic, while presenting its main constrains in a clear way and proposing enhanced methods at the reach of the non-expert user. In detail, this study starts by analysing the definition of specific class mapping and why the conventional multi-class supervised classification process may yield sub-optimal outcomes. Attention then is turn to the previous works that have tackled this problem. From here a synthesis is made, categorising and characterising previous methodologies. Its then learnt that the methodologies tackling specific class mapping fall under two broad categories, the binarisation approaches and the singe-class approaches, and that both types are not without problems. This is the starting point of the development component of this dissertation that branches out in three research lines. First, cost-sensitive learning is utilised to improve specific class mapping. In previous studies it was shown that it may be susceptible to data imbalance problems present in the training data set, since the classes of interest are often a small part of the training set. As a result the classification may be biased towards the largest classes and, thus, be sub-optimal for the discrimination of the classes of interest. Here cost-sensitive learning is used to balance the training data set to minimise the effects of data imbalance. In this approach errors committed in the minority class are treated as being costlier than errors committed in the majority class. Cost-sensitive approaches are typically implemented by weighting training data points accordingly to their importance to the analysis. By shifting the weight of the data set from the majority class to the minority class, the user is capable to inform the learning process that training data points in the minority class are as critical as the points in the majority class. The results of this study indicate that this simple approach is capable to improve the process of specific class mapping by increasing the accuracy to which the classes of interest are discriminated. Second, the combined use single-class classifiers for specific class mapping is explored. Supervised algorithms for single-class classification are particularly attractive due to its reduced training requirements. Unlike other methods where all classes present in the study site regardless of its relevance for the particular objective to the users, single-class classifiers rely exclusively on the training of the class of interest. However, these methods can only solve specific classification problems with one class of interest. If more classes are important, those methods cannot be directly utilised. Here is proposed three combining methodologies to combine single-class classifiers to map subsets of land cover classes. The results indicate that an intelligent combination of single-class classifiers can be used to achieve accurate results, statistically noninferior to the standard multi-class classification, without the need of an exhaustive training set, saving resources that can be allocated to other steps of the data analysis process. Third, the combined use of cost-sensitive and semi-supervised learning to improve specific class mapping is explored. A limitation of the specific class binary approaches is that they still require training data from secondary classes, and that may be costly. On the other hand, a limitation of the specific class single-class approaches is that, while requiring only training data from the specific classes of interest, this method tend to overestimate the extension of the classes of interest. This is because the classifier is trained without information about the negative part of the classification space. A way to overcome this is with semi-supervised learning, where the data points for the negative class are randomly sampled from the classification space. However that may include false negatives. To overcome this difficult, cost-sensitive learning is utilised to mitigate the effect of these potentially misclassified data points. Cost weights were here defined using an exponential model that assign more weight to the negative data points that are more likely to be correctly labelled and less to the points that are more likely to be mislabelled. The results show that accuracy achieved with the proposed method is statistically non-inferior to that achieved with standard binary classification requiring however much less training effort

    Ensemble learning in the presence of noise

    Full text link
    Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingenieria Informática. Fecha de lectura: 14-02-2019La disponibilidad de grandes cantidades de datos provenientes de diversas fuentes ampl a enormemente las posibilidades para una explotaci on inteligente de la informaci on. No obstante, la extracci on de conocimiento a partir de datos en bruto es una tarea compleja que requiere el desarrollo de m etodos de aprendizaje e cientes y robustos. Una de las principales di cultades en el aprendizaje autom atico es la presencia de ruido en los datos. En esta tesis, abordamos el problema del aprendizaje autom atico en presencia de ruido. Para este prop osito, nos centraremos en el uso de conjuntos de clasi cadores. Nuestro objetivo es crear colecciones de aprendices base cuyos resultados, al ser combinados, mejoren no solo la precisi on sino tambi en la robustez de las predicciones. Una primera contribuci on de esta tesis es aprovechar el ratio de submuestreo para construir conjuntos de clasi cadores basados en bootstrap (como bagging o random forests) precisos y robustos. La idea de utilizar el submuestreo como mecanismo de regularizaci on tambi en se explota para la detecci on de ejemplos ruidosos. En concreto, los ejemplos que est an mal clasi cados por una fracci on de los miembros del conjunto se marcan como ruido. El valor optimo de este umbral se determina mediante validaci on cruzada. Las instancias ruidosas se eliminan ( ltrado) o se corrigen sus etiquetas de su clase (limpieza). Finalmente, se construye un conjunto de clasi cadores utilizando los datos de entrenamiento limpios ( ltrados o limpiados). Otra contribuci on de esta tesis es vote-boosting, un m etodo de conjuntos secuencial especialmente dise~nado para ser robusto al ruido en las etiquetas de clase. Vote-boosting reduce la excesiva sensibilidad a este tipo de ruido de los algoritmos basados en boosting, como adaboost. En general, los algoritmos basados en booting modi can la distribuci on de pesos en los datos de entrenamiento progresivamente para enfatizar instancias mal clasi cadas. Este enfoque codicioso puede terminar dando un peso excesivamente alto a instancias cuya etiqueta de clase sea incorrecta. Por el contrario, en vote-boosting, el enfasis se basa en el nivel de incertidumbre (acuerdo o desacuerdo) de la predicci on del conjunto, independientemente de la etiqueta de clase. Al igual que en boosting, voteboosting se puede analizar como una optimizaci on de descenso por gradiente en espacio funcional. Uno de los problemas abiertos en el aprendizaje de conjuntos es c omo construir combinaciones de clasi cadores fuertes. La principal di cultad es lograr diversidad entre los clasi cadores base sin un deterioro signi cativo de su rendimiento y sin aumentar en exceso el coste computacional. En esta tesis, proponemos construir conjuntos de SVM con la ayuda de mecanismos de aleatorizaci on y optimizaci on. Gracias a esta combinaci on de estrategias complementarias, es posible crear conjuntos de SVM que son mucho m as r apidos de entrenar y son potencialmente m as precisos que un SVM individual optimizado. Por ultimo, hemos desarrollado un procedimiento para construir conjuntos heterog eneos que interpolan sus decisiones a partir de conjuntos homog eneos compuestos por diferentes tipos de clasi cadores. La composici on optima del conjunto se determina mediante validaci on cruzada. v

    Deep Learning Methods for Remote Sensing

    Get PDF
    Remote sensing is a field where important physical characteristics of an area are exacted using emitted radiation generally captured by satellite cameras, sensors onboard aerial vehicles, etc. Captured data help researchers develop solutions to sense and detect various characteristics such as forest fires, flooding, changes in urban areas, crop diseases, soil moisture, etc. The recent impressive progress in artificial intelligence (AI) and deep learning has sparked innovations in technologies, algorithms, and approaches and led to results that were unachievable until recently in multiple areas, among them remote sensing. This book consists of sixteen peer-reviewed papers covering new advances in the use of AI for remote sensing

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

    DEVELOPING INNOVATIVE SPECTRAL AND MACHINE LEARNING METHODS FOR MINERAL AND LITHOLOGICAL CLASSIFICATION USING MULTI-SENSOR DATASETS

    Get PDF
    The sustainable exploration of mineral resources plays a significant role in the economic development of any nation. The lithological maps and surface mineral distribution can be vital baseline data to narrow down the geochemical and geophysical analysis potential areas. This study developed innovative spectral and Machine Learning (ML) methods for mineral and lithological classification. Multi-sensor datasets such as Airborne Visible/Infrared Imaging Spectrometer-Next Generation (AVIRIS-NG), Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), Advanced Land Observing (ALOS) Phased Array type L-band Synthetic Aperture Radar (PALSAR), Sentinel-1, and Digital Elevation Model (DEM) were utilized. The study mapped the hydrothermal alteration minerals derived from Spectral Mapping Methods (SMMs), including Spectral Angle Mapper (SAM), Spectral Information Divergence (SID), and SIDSAMtan using high-resolution AVIRIS-NG hyperspectral data in the Hutti-Maski area (India). The SIDSAMtan outperforms SID and SAM in mineral mapping. A spectral similarity matrix of target and non-target classes based optimum threshold selection was developed to implement the SMMs successfully. Three new effective SMMs such as Dice Spectral Similarity Coefficient (DSSC), Kumar-Johnson Spectral Similarity Coefficient (KJSSC), and their hybrid, i.e., KJDSSCtan has been proposed, which outperforms the existing SMMs (i.e., SAM, SID, and SIDSAMtan) in spectral discrimination of spectrally similar minerals. The developed optimum threshold selection and proposed SMMs are recommended for accurate mineral mapping using hyperspectral data. An integrated spectral enhancement and ML methods have been developed to perform automated lithological classification using AVIRIS-NG hyperspectral data. The Support Vector Machine (SVM) outperforms the Random Forest (RF) and Linear Discriminant Analysis (LDA) in lithological classification. The performance of SVM also shows the least sensitivity to the number and uncertainty of training datasets. This study proposed a multi-sensor datasets-based optimal integration of spectral, morphological, and textural characteristics of rocks for accurate lithological classification using ML models. Different input features, such as (a) spectral, (b) spectral and transformed spectral, (c) spectral and morphological, (d) spectral and textural, and (e) optimum hybrid, were evaluated for lithological classification. The developed approach has been assessed in the Chattarpur area (India) consists of similar spectral characteristics and poorly exposed rocks, weathered, and partially vegetated terrain. The optimal hybrid input features outperform other input features to accurately classify different rock types using the SVM and RF models, which is ~15% higher than as obtained using spectral input features alone. The developed integrated approach of spectral enhancement and ML algorithms, and a multi-sensor datasets-based optimal integration of spectral, morphological, and textural characteristics of rocks, are recommended for accurate lithological classification. The developed methods can be effectively utilized in other remote sensing applications, such as vegetation/forest mapping and soil classification
    • …
    corecore