140 research outputs found

    Improving minority classes' prediction accuracy using the geometric SMOTE algorithm

    Get PDF
    Douzas, G., Bacao, F., Fonseca, J., & Khudinyan, M. (2019). Imbalanced learning in land cover classification: Improving minority classes' prediction accuracy using the geometric SMOTE algorithm. Remote Sensing, 11(24), [3040]. https://doi.org/10.3390/rs11243040The automatic production of land use/land cover maps continues to be a challenging problem, with important impacts on the ability to promote sustainability and good resource management. The ability to build robust automatic classifiers and produce accurate maps can have a significant impact on the way we manage and optimize natural resources. The difficulty in achieving these results comes from many different factors, such as data quality and uncertainty. In this paper, we address the imbalanced learning problem, a common and difficult conundrum in remote sensing that affects the quality of classification results, by proposing Geometric-SMOTE, a novel oversampling method, as a tool for addressing the imbalanced learning problem in remote sensing. Geometric-SMOTE is a sophisticated oversampling algorithm which increases the quality of the instances generated in previous methods, such as the synthetic minority oversampling technique. The performance of Geometric- SMOTE, in the LUCAS (Land Use/Cover Area Frame Survey) dataset, is compared to other oversamplers using a variety of classifiers. The results show that Geometric-SMOTE significantly outperforms all the other oversamplers and improves the robustness of the classifiers. These results indicate that, when using imbalanced datasets, remote sensing researchers should consider the use of these new generation oversamplers to increase the quality of the classification results.publishersversionpublishe

    Using LUCAS survey and Recurrent Neural Networks to produce LCLU classification based on a Satellite Image time series of Sentinel-2

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceThe need of timely and accurate information for the territory has increased over the years, making Land Cover Land Use (LCLU) mapping one of the most common application of remote sensing. Recently, the advances in satellite technology and the open access policies for remote sensing data increased the interest in exploring satellite image time series. In addition, the attention of researchers has shifted from standard machine learning algorithms (e.g., Support Vector Machines and Random Forest) to Recurrent Neural Networks due to their ability of exploiting sequential information. However, acquiring reference data to train these algorithms is still a hurdle. This study aims to evaluate the capability of a Gated Recurrent Unit in performing pixel-level LCLU classification of a satellite image time series, using Sentinel-2 imagery and having the LUCAS survey as reference data. To assess the performance of our model we compared it to state-of-the-art classifiers (SVM and RF). Due to the unbalance nature of the LUCAS survey, we applied oversampling to this dataset to increase the performance of our models, testing three different oversampling techniques. The results attained showed that Recurrent Neural Networks did not outperform the other state-of-the-art algorithms, when trained with a limited number of sampling units, and that oversampling the LUCAS survey increased the performance of all the classifiers. Finally, we were able to demonstrate that it is possible to produce LCLU classification of satellite image time series using only open-source data by using Sentinel-2 imagery and the LUCAS survey as refence data

    Introducing artificial data generation in active learning for land use/land cover classification

    Get PDF
    Fonseca, J., Douzas, G., & Bacao, F. (2021). Increasing the effectiveness of active learning: Introducing artificial data generation in active learning for land use/land cover classification. Remote Sensing, 13(13), 1-20. [2619]. https://doi.org/10.3390/rs13132619In remote sensing, Active Learning (AL) has become an important technique to collect informative ground truth data “on-demand” for supervised classification tasks. Despite its effectiveness, it is still significantly reliant on user interaction, which makes it both expensive and time consuming to implement. Most of the current literature focuses on the optimization of AL by modifying the selection criteria and the classifiers used. Although improvements in these areas will result in more effective data collection, the use of artificial data sources to reduce human–computer interaction remains unexplored. In this paper, we introduce a new component to the typical AL framework, the data generator, a source of artificial data to reduce the amount of user-labeled data required in AL. The implementation of the proposed AL framework is done using Geometric SMOTE as the data generator. We compare the new AL framework to the original one using similar acquisition functions and classifiers over three AL-specific performance metrics in seven benchmark datasets. We show that this modification of the AL framework significantly reduces cost and time requirements for a successful AL implementation in all of the datasets used in the experiment.publishersversionpublishe

    Quantifying soybean phenotypes using UAV imagery and machine learning, deep learning methods

    Get PDF
    Crop breeding programs aim to introduce new cultivars to the world with improved traits to solve the food crisis. Food production should need to be twice of current growth rate to feed the increasing number of people by 2050. Soybean is one the major grain in the world and only US contributes around 35 percent of world soybean production. To increase soybean production, breeders still rely on conventional breeding strategy, which is mainly a 'trial and error' process. These constraints limit the expected progress of the crop breeding program. The goal was to quantify the soybean phenotypes of plant lodging and pubescence color using UAV-based imagery and advanced machine learning. Plant lodging and soybean pubescence color are two of the most important phenotypes for soybean breeding programs. Soybean lodging and pubescence color is conventionally evaluated visually by breeders, which is time-consuming and subjective to human errors. The goal of this study was to investigate the potential of unmanned aerial vehicle (UAV)-based imagery and machine learning in the assessment of lodging conditions and deep learning in the assessment pubescence color of soybean breeding lines. A UAV imaging system equipped with an RGB (red-green-blue) camera was used to collect the imagery data of 1,266 four-row plots in a soybean breeding field at the reproductive stage. Soybean lodging scores and pubescence scores were visually assessed by experienced breeders. Lodging scores were grouped into four classes, i.e., non-lodging, moderate lodging, high lodging, and severe lodging. In contrast, pubescence color scores were grouped into three classes, i.e., gray, tawny, and segregation. UAV images were stitched to build orthomosaics, and soybean plots were segmented using a grid method. Twelve image features were extracted from the collected images to assess the lodging scores of each breeding line. Four models, i.e., extreme gradient boosting (XGBoost), random forest (RF), K-nearest neighbor (KNN), and artificial neural network (ANN), were evaluated to classify soybean lodging classes. Five data pre-processing methods were used to treat the imbalanced dataset to improve the classification accuracy. Results indicate that the pre-processing method SMOTE-ENN consistently performs well for all four (XGBoost, RF, KNN, and ANN) classifiers, achieving the highest overall accuracy (OA), lowest misclassification, higher F1-score, and higher Kappa coefficient. This suggests that Synthetic Minority Over-sampling-Edited Nearest Neighbor (SMOTE-ENN) may be an excellent pre-processing method for using unbalanced datasets and classification tasks. Furthermore, an overall accuracy of 96 percent was obtained using the SMOTE-ENN dataset and ANN classifier. On the other hand, to classify the soybean pubescence color, seven pre-trained deep learning models, i.e., DenseNet121, DenseNet169, DenseNet201, ResNet50, InceptionResNet-V2, Inception-V3, and EfficientNet were used, and images of each plot were fed into the model. Data was enhanced using two rotational and two scaling factors to increase the datasets. Among the seven pre-trained deep learning models, ResNet50 and DenseNet121 classifiers showed a higher overall accuracy of 88 percent, along with higher precision, recall, and F1-score for all three classes of pubescence color. In conclusion, the developed UAV-based high-throughput phenotyping system can gather image features to estimate soybean crucial phenotypes and classify the phenotypes, which will help the breeders in phenotypic variations in breeding trials. Also, the RGB imagery-based classification could be a cost-effective choice for breeders and associated researchers for plant breeding programs in identifying superior genotypes.Includes bibliographical references

    DATA MINING AND IMAGE CLASSIFICATION USING GENETIC PROGRAMMING

    Get PDF
    Genetic programming (GP), a capable machine learning and search method, motivated by Darwinian-evolution, is an evolutionary learning algorithm which automatically evolves computer programs in the form of trees to solve problems. This thesis studies the application of GP for data mining and image processing. Knowledge discovery and data mining have been widely used in business, healthcare, and scientific fields. In data mining, classification is supervised learning that identifies new patterns and maps the data to predefined targets. A GP based classifier is developed in order to perform these mappings. GP has been investigated in a series of studies to classify data; however, there are certain aspects which have not formerly been studied. We propose an optimized GP classifier based on a combination of pruning subtrees and a new fitness function. An orthogonal least squares algorithm is also applied in the training phase to create a robust GP classifier. The proposed GP classifier is validated by 10-fold cross validation. Three areas were studied in this thesis. The first investigation resulted in an optimized genetic-programming-based classifier that directly solves multi-class classification problems. Instead of defining static thresholds as boundaries to differentiate between multiple labels, our work presents a method of classification where a GP system learns the relationships among experiential data and models them mathematically during the evolutionary process. Our approach has been assessed on six multiclass datasets. The second investigation was to develop a GP classifier to segment and detect brain tumors on magnetic resonance imaging (MRI) images. The findings indicated the high accuracy of brain tumor classification provided by our GP classifier. The results confirm the strong ability of the developed technique for complicated image classification problems. The third was to develop a hybrid system for multiclass imbalanced data classification using GP and SMOTE which was tested on satellite images. The finding showed that the proposed approach improves both training and test results when the SMOTE technique is incorporated. We compared our approach in terms of speed with previous GP algorithms as well. The analyzed results illustrate that the developed classifier produces a productive and rapid method for classification tasks that outperforms the previous methods for more challenging multiclass classification problems. We tested the approaches presented in this thesis on publicly available datasets, and images. The findings were statistically tested to conclude the robustness of the developed approaches

    The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for various applications, promoting environmental sustainability and good resource management. Although, their production continues to be a challenging task. There are various factors that contribute towards the difficulty of generating accurate, timely updated LULC maps, both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a crucial step for any Machine Learning task, is particularly important in the remote sensing domain due to the overwhelming amount of raw, unlabeled data continuously gathered from multiple remote sensing missions. However a significant part of the state-of-the-art focuses on scenarios with full access to labeled training data with relatively balanced class distributions. This thesis focuses on the challenges found in automatic LULC classification tasks, specifically in data preprocessing tasks. We focus on the development of novel Active Learning (AL) and imbalanced learning techniques, to improve ML performance in situations with limited training data and/or the existence of rare classes. We also show that much of the contributions presented are not only successful in remote sensing problems, but also in various other multidisciplinary classification problems. The work presented in this thesis used open access datasets to test the contributions made in imbalanced learning and AL. All the data pulling, preprocessing and experiments are made available at https://github.com/joaopfonseca/publications. The algorithmic implementations are made available in the Python package ml-research at https://github.com/joaopfonseca/ml-research

    Survey on highly imbalanced multi-class data

    Get PDF
    Machine learning technology has a massive impact on society because it offers solutions to solve many complicated problems like classification, clustering analysis, and predictions, especially during the COVID-19 pandemic. Data distribution in machine learning has been an essential aspect in providing unbiased solutions. From the earliest literatures published on highly imbalanced data until recently, machine learning research has focused mostly on binary classification data problems. Research on highly imbalanced multi-class data is still greatly unexplored when the need for better analysis and predictions in handling Big Data is required. This study focuses on reviews related to the models or techniques in handling highly imbalanced multi-class data, along with their strengths and weaknesses and related domains. Furthermore, the paper uses the statistical method to explore a case study with a severely imbalanced dataset. This article aims to (1) understand the trend of highly imbalanced multi-class data through analysis of related literatures; (2) analyze the previous and current methods of handling highly imbalanced multi-class data; (3) construct a framework of highly imbalanced multi-class data. The chosen highly imbalanced multi-class dataset analysis will also be performed and adapted to the current methods or techniques in machine learning, followed by discussions on open challenges and the future direction of highly imbalanced multi-class data. Finally, for highly imbalanced multi-class data, this paper presents a novel framework. We hope this research can provide insights on the potential development of better methods or techniques to handle and manipulate highly imbalanced multi-class data

    Class imbalance ensemble learning based on the margin theory

    Get PDF
    The proportion of instances belonging to each class in a data-set plays an important role in machine learning. However, the real world data often suffer from class imbalance. Dealing with multi-class tasks with different misclassification costs of classes is harder than dealing with two-class ones. Undersampling and oversampling are two of the most popular data preprocessing techniques dealing with imbalanced data-sets. Ensemble classifiers have been shown to be more effective than data sampling techniques to enhance the classification performance of imbalanced data. Moreover, the combination of ensemble learning with sampling methods to tackle the class imbalance problem has led to several proposals in the literature, with positive results. The ensemble margin is a fundamental concept in ensemble learning. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. In this paper, we propose a novel ensemble margin based algorithm, which handles imbalanced classification by employing more low margin examples which are more informative than high margin samples. This algorithm combines ensemble learning with undersampling, but instead of balancing classes randomly such as UnderBagging, our method pays attention to constructing higher quality balanced sets for each base classifier. In order to demonstrate the effectiveness of the proposed method in handling class imbalanced data, UnderBagging and SMOTEBagging are used in a comparative analysis. In addition, we also compare the performances of different ensemble margin definitions, including both supervised and unsupervised margins, in class imbalance learning

    Deep Learning for Remote Sensing Image Processing

    Get PDF
    Remote sensing images have many applications such as ground object detection, environmental change monitoring, urban growth monitoring and natural disaster damage assessment. As of 2019, there were roughly 700 satellites listing “earth observation” as their primary application. Both spatial and temporal resolutions of satellite images have improved consistently in recent years and provided opportunities in resolving fine details on the Earth\u27s surface. In the past decade, deep learning techniques have revolutionized many applications in the field of computer vision but have not fully been explored in remote sensing image processing. In this dissertation, several state-of-the-art deep learning models have been investigated and customized for satellite image processing in the applications of landcover classification and ground object detection. First, a simple and effective Convolutional Neural Network (CNN) model is developed to detect fresh soil from tunnel digging activities near the U.S. and Mexico border by using pansharpened synthetic hyperspectral images. These tunnels’ exits are usually hidden under warehouses and are used for illegal activities, for example, by drug dealers. Detecting fresh soil nearby is an indirect way to search for these tunnels. While multispectral images have been used widely and regularly in remote sensing since the 1970s, with the fast advances in hyperspectral sensors, hyperspectral imagery is becoming popular. A combination of 80 synthetic hyperspectral channels with the original eight multispectral channels collected by the WorldView-2 satellite are used by CNN to detect fresh soil. Experimental results show that detection performance can be significantly improved by the combination of synthetic hyperspectral images with those original multispectral channels. Second, an end-to-end, pixel-level Fully Convolutional Network (FCN) model is implemented to estimate the number of refugee tents in the Rukban area near the Syrian-Jordan border using high-resolution multispectral satellite images collected by WordView-2. Rukban is a desert area crossing the border between Syria and Jordan, and thousands of Syrian refugees have fled into this area since the Syrian civil war in 2014. In the past few years, the number of refugee shelters for the forcibly displaced Syrian refugees in this area has increased rapidly. Estimating the location and number of refugee tents has become a key factor in maintaining the sustainability of the refugee shelter camps. Manually counting the shelters is labor-intensive and sometimes prohibitive given the large quantities. In addition, these shelters/tents are usually small in size, irregular in shape, and sparsely distributed in a very large area and could be easily missed by the traditional image-analysis techniques, making the image-based approaches also challenging. The FCN model is also boosted by transfer learning with the knowledge in the pre-trained VGG-16 model. Experimental results show that the FCN model is very accurate and has less than 2% of error. Last, we investigate the Generative Adversarial Networks (GAN) to augment training data to improve the training of FCN model for refugee tent detection. Segmentation based methods like FCN require a large amount of finely labeled images for training. In practice, this is labor-intensive, time consuming, and tedious. The data-hungry problem is currently a big hurdle for this application. Experimental results show that the GAN model is a better tool as compared to traditional methods for data augmentation. Overall, our research made a significant contribution to remote sensing image processin

    Computational intelligence contributions to readmisision risk prediction in Healthcare systems

    Get PDF
    136 p.The Thesis tackles the problem of readmission risk prediction in healthcare systems from a machine learning and computational intelligence point of view. Readmission has been recognized as an indicator of healthcare quality with primary economic importance. We examine two specific instances of the problem, the emergency department (ED) admission and heart failure (HF) patient care using anonymized datasets from three institutions to carry real-life computational experiments validating the proposed approaches. The main difficulties posed by this kind of datasets is their high class imbalance ratio, and the lack of informative value of the recorded variables. This thesis reports the results of innovative class balancing approaches and new classification architectures
    corecore