155 research outputs found

    Misclassification analysis for the class imbalance problem

    Get PDF
    In classification, the class imbalance issue normally causes the learning algorithm to be dominated by the majority classes and the features of the minority classes are sometimes ignored. This will indirectly affect how human visualise the data. Therefore, special care is needed to take care of the learning algorithm in order to enhance the accuracy for the minority classes. In this study, the use of misclassification analysis is investigated for data re-distribution. Several under-sampling techniques and hybrid techniques using misclassification analysis are proposed in the paper. The benchmark data sets obtained from the University of California Irvine (UCI) machine learning repository are used to investigate the performance of the proposed techniques. The results show that the proposed hybrid technique presents the best performance in the experiment

    Modelling of Floods in Urban Areas

    Get PDF
    This Special Issue publishes the latest advances and developments concerning the modelling of flooding in urban areas and contributes to our scientific understanding of the flooding processes and the appropriate evaluation of flood impacts. This issue contains contributions of novel methodologies including flood forecasting methods, data acquisition techniques, experimental research in urban drainage systems and/or sustainable drainage systems, and new numerical and simulation approaches in nine papers with contributions from over forty authors

    Learning from imbalanced data in face re-identification using ensembles of classifiers

    Get PDF
    Face re-identification is a video surveillance application where systems for video-to-video face recognition are designed using faces of individuals captured from video sequences, and seek to recognize them when they appear in archived or live videos captured over a network of video cameras. Video-based face recognition applications encounter challenges due to variations in capture conditions such as pose, illumination etc. Other challenges in this application are twofold; 1) the imbalanced data distributions between the face captures of the individuals to be re-identified and those of other individuals 2) varying degree of imbalance during operations w.r.t. the design data. Learning from imbalanced data is challenging in general due in part to the bias of performance in most two-class classification systems towards correct classification of the majority (negative, or non-target) class (face images/frames captured from the individuals in not to be re-identified) better than the minority (positive, or target) class (face images/frames captured from the individual to be re-identified) because most two-class classification systems are intended to be used under balanced data condition. Several techniques have been proposed in the literature to learn from imbalanced data that either use data-level techniques to rebalance data (by under-sampling the majority class, up-sampling the minority class, or both) for training classifiers or use algorithm-level methods to guide the learning process (with or without cost sensitive approaches) such that the bias of performance towards correct classification of the majority class is neutralized. Ensemble techniques such as Bagging and Boosting algorithms have been shown to efficiently utilize these methods to address imbalance. However, there are issues faced by these techniques in the literature: (1) some informative samples may be neglected by random under-sampling and adding synthetic positive samples through upsampling adds to training complexity, (2) cost factors must be pre-known or found, (3) classification systems are often optimized and compared using performance measurements (like accuracy) that are unsuitable for imbalance problem; (4) most learning algorithms are designed and tested on a fixed imbalance level of data, which may differ from operational scenarios; The objective of this thesis is to design specialized classifier ensembles to address the issue of imbalance in the face re-identification application and as sub-goals avoiding the abovementioned issues faced in the literature. In addition achieving an efficient classifier ensemble requires a learning algorithm to design and combine component classifiers that hold suitable diversity-accuracy trade off. To reach the objective of the thesis, four major contributions are made that are presented in three chapters summarized in the following. In Chapter 3, a new application-based sampling method is proposed to group samples for under-sampling in order to improve diversity-accuracy trade-off between classifiers of the ensemble. The proposed sampling method takes the advantage of the fact that in face re-identification applications, facial regions of a same person appearing in a camera field of view may be regrouped based on their trajectories found by face tracker. A partitional Bagging ensemble method is proposed that accounts for possible variations in imbalance level of the operational data by combining classifiers that are trained on different imbalance levels. In this method, all samples are used for training classifiers and information loss is therefore avoided. In Chapter 4, a new ensemble learning algorithm called Progressive Boosting (PBoost) is proposed that progressively inserts uncorrelated groups of samples into a Boosting procedure to avoid loosing information while generating a diverse pool of classifiers. From one iteration to the next, the PBoost algorithm accumulates these uncorrelated groups of samples into a set that grows gradually in size and imbalance. This algorithm is more sophisticated than the one proposed in Chapter 3 because instead of training the base classifiers on this set, the base classifiers are trained on balanced subsets sampled from this set and validated on the whole set. Therefore, the base classifiers are more accurate while the robustness to imbalance is not jeopardized. In addition, the sample selection is based on the weights that are assigned to samples which correspond to their importance. In addition, the computation complexity of PBoost is lower than Boosting ensemble techniques in the literature for learning from imbalanced data because not all of the base classifiers are validated on all negative samples. A new loss factor is also proposed to be used in PBoost to avoid biasing performance towards the negative class. Using this loss factor, the weight update of samples and classifier contribution in final predictions are set according to the ability of classifiers to recognize both classes. In comparing the performance of the classifier systems in Chapter 3 and 4, a need is faced for an evaluation space that compares classifiers in terms of a suitable performance metric over all of their decision thresholds, different imbalance levels of test data, and different preference between classes. The F-measure is often used to evaluate two-class classifiers on imbalanced data, and no global evaluation space was available in the literature for this measure. Therefore, in Chapter 5, a new global evaluation space for the F-measure is proposed that is analogous to the cost curves for expected cost. In this space, a classifier is represented as a curve that shows its performance over all of its decision thresholds and a range of possible imbalance levels for the desired preference of true positive rate to precision. These properties are missing in ROC and precision-recall spaces. This space also allows us to empirically improve the performance of specialized ensemble learning methods for imbalance under a given operating condition. Through a validation, the base classifiers are combined using a modified version of the iterative Boolean combination algorithm such that the selection criterion in this algorithm is replaced by F-measure instead of AUC, and the combination is carried out for each operating condition. The proposed approaches in this thesis were validated and compared using synthetic data and videos from the Faces In Action, and COX datasets that emulate face re-identification applications. Results show that the proposed techniques outperforms state of the art techniques over different levels of imbalance and overlap between classes

    Novel MLR-RF-Based Geospatial Techniques: A Comparison with OK

    Get PDF
    Geostatistical estimation methods rely on experimental variograms that are mostly erratic, leading to subjective model fitting and assuming normal distribution during conditional simula-tions. In contrast, Machine Learning Algorithms (MLA) are (1) free of such limitations, (2) can in-corporate information from multiple sources and therefore emerge with increasing interest in real-time resource estimation and automation. However, MLAs need to be explored for robust learning of phenomena, better accuracy, and computational efficiency. This paper compares MLAs, i.e., Multiple Linear Regression (MLR) and Random Forest (RF), with Ordinary Kriging (OK). The techniques were applied to the publicly available Walkerlake dataset, while the exhaustive Walker Lake dataset was validated. The results of MLR were significant (p \u3c 10 × 10−5), with correlation coeffi-cients of 0.81 (R-square = 0.65) compared to 0.79 (R-square = 0.62) from the RF and OK methods. Additionally, MLR was automated (free from an intermediary step of variogram modelling as in OK), produced unbiased estimates, identified key samples representing different zones, and had higher computational efficiency

    Bootstrap - Inspired Techniques in Computation Intelligence

    Full text link

    Advancement of field-deployable, computer-vision wood identification technology

    Get PDF
    Globally, illegal logging poses a significant threat. This results in environmental damage as well as lost profits for legitimate wood product producers and taxes for governments. A global value of 30to30 to 100 billion is estimated to be associated with illegal logging and processing. Field identification of wood species is fundamental to combating species fraud and misrepresentation in global wood trade. Using computer vision wood identification (CVWID) systems, wood can be identified without the need for time-consuming and costly offsite visual inspections by trained wood anatomists. While CVWID research has received significant attention, most studies have not considered the generalization capabilities of the models by testing them on a field sample, and only report overall accuracy without considering misclassifications. The aim of this dissertation is to advance the design and development of CVWID systems by addressing three objectives: 1) to develop functional, field-deployable CVWID models for Peruvian and North American hardwoods, 2) test the ability of CVWID to solve increasingly challenging problems (e.g., larger class sizes, lower anatomical diversity, and spatial heterogeneity in the context of porosity), and 3) to evaluate the generalization capabilities by testing models on independent specimens not included in training and analyzing misclassifications. This research features four main sections: 1) an introduction summarizing each chapter, 2) a chapter (Chapter 2) developing a 24-class model for Peruvian hardwoods and testing its generalization capabilities with independent specimens not used in training, 3) a chapter (Chapter 3) on the design and implementation of a continental scale 22-class model for North American diffuse-porous hardwoods using wood anatomy-driven model performance evaluation, and 3) a chapter (Chapter 4) on the development of a 17-class models for North American ring-porous hardwoods, in particular examining the model\u27s effectiveness in dealing with the greater spatial heterogeneity of ring-porous hardwoods

    Collective Machine Learning: Team Learning and Classification in Multi-Agent Systems

    Get PDF
    This dissertation focuses on the collaboration of multiple heterogeneous, intelligent agents (hardware or software) which collaborate to learn a task and are capable of sharing knowledge. The concept of collaborative learning in multi-agent and multi-robot systems is largely under studied, and represents an area where further research is needed to gain a deeper understanding of team learning. This work presents experimental results which illustrate the importance of heterogeneous teams of collaborative learning agents, as well as outlines heuristics which govern successful construction of teams of classifiers. A number of application domains are studied in this dissertation. One approach is focused on the effects of sharing knowledge and collaboration of multiple heterogeneous, intelligent agents (hardware or software) which work together to learn a task. As each agent employs a different machine learning technique, the system consists of multiple knowledge sources and their respective heterogeneous knowledge representations. Collaboration between agents involves sharing knowledge to both speed up team learning, as well as to refine the team's overall performance and group behavior. Experiments have been performed that vary the team composition in terms of machine learning algorithms, learning strategies employed by the agents, and sharing frequency for a predator-prey cooperative pursuit task. For lifelong learning, heterogeneous learning teams were more successful compared to homogeneous learning counterparts. Interestingly, sharing increased the learning rate, but sharing with higher frequency showed diminishing results. Lastly, knowledge conflicts are reduced over time, as more sharing takes place. These results support further investigation of the merits of heterogeneous learning. This dissertation also focuses on discovering heuristics for constructing successful teams of heterogeneous classifiers, including many aspects of team learning and collaboration. In one application, multi-agent machine learning and classifier combination are utilized to learn rock facies sequences from wireline well log data. Gas and oil reservoirs have been the focus of modeling efforts for many years as an attempt to locate zones with high volumes. Certain subsurface layers and layer sequences, such as those containing shale, are known to be impermeable to gas and/or liquid. Oil and natural gas then become trapped by these layers, making it possible to drill wells to reach the supply, and extract for use. The drilling of these wells, however, is costly. Here, the focus is on how to construct a successful set of classifiers, which periodically collaborate, to increase the classification accuracy. Utilizing multiple, heterogeneous collaborative learning agents is shown to be successful for this classification problem. We were able to obtain 84.5% absolute accuracy using the Multi-Agent Collaborative Learning Architecture, an improvement of about 6.5% over the best results achieved by Kansas Geological Survey with the same data set. Several heuristics are presented for constructing teams of multiple collaborative classifiers for predicting rock facies. Another application utilizes multi-agent machine learning and classifier combination to learn water presence using airborne polar radar data acquired from Greenland in 1999 and 2007. Ground and airborne depth-soundings of the Greenland and Antarctic ice sheets have been used for many years to determine characteristics such as ice thickness, subglacial topography, and mass balance of large bodies of ice. Ice coring efforts have supported these radar data to provide ground truth for validation of the state (wet or frozen) of the interface between the bottom of the ice sheet and the underlying bedrock. Subglacial state governs the friction, flow speed, transport of material, and overall change of the ice sheet. In this dissertation, we focus on how to construct a successful set of classifiers which periodically collaborate to increase classification accuracy. The underlying method results in radar independence, allowing model transfer from 1999 to 2007 to produce water presence maps of the Greenland ice sheet with differing radars. We were able to obtain 86% accuracy using the Multi-Agent Collaborative Learning Architecture with this data set. Utilizing multiple, heterogeneous collaborative learning agents is shown to be successful for this classification problem as well. Several heuristics, some of which agree with those found in the other applications, are presented for constructing teams of multiple collaborative classifiers for predicting subglacial water presence. General findings from these different experiments suggest that constructing a team of classifiers using a heterogeneous mixture of homogeneous teams is preferred. Larger teams generally perform better, as decisions from multiple learners can be combined to arrive at a consensus decision. Employing heterogeneous learning algorithms integrates different error models to arrive at higher accuracy classification from complementary knowledge bases. Collaboration, although not found to be universally useful, offers certain team configurations an advantage. Collaboration with low to medium frequency was found to be beneficial, while high frequency collaboration was found to be detrimental to team classification accuracy. Full mode learning, where each learner receives the entire training set for the learning phase, consistently outperforms independent mode learning, where the training set is distributed to all learners in a team in a non-overlapping fashion. Results presented in this dissertation support the application of multi-agent machine learning and collaboration to current challenging, real-world classification problems

    Application of random-forest machine learning algorithm for mineral predictive mapping of Fe-Mn crusts in the World Ocean

    Get PDF
    Mineral prospectivity mapping constitutes an efficient tool for delineating areas of highest interest to guide future exploration. Multiple knowledge-driven approaches have been applied for the creation of prospectivity maps for deep-sea ferromanganese (Fe-Mn) crusts over the last decades. The results of a data-driven approach making use of an extensive data collection exercise on occurrences of Fe-Mn crusts in the World Ocean and recent increase in global marine datasets are presented. A Random Forest machine learning algorithm is applied, and results compared with previously established expert-driven maps. Optimal predictive conditions for the algorithm are observed for (i) a forest size superior to a hundred trees, (ii) a training dataset larger than 10%, and (iii) a number of predictors to be used as nodes superior to two. The confusion matrix and out-of-bag errors on the remaining unused data highlight excellent predictive capabilities of the trained model with a prediction accuracy for Fe-Mn crusts of 87.2% and 98.2% for non-crusts locations, with a Kohen’s K index of 0.84, validating its application for prediction at the World scale. The slope of the seafloor, sediment thickness, sediment type, biological productivity, and abyssal mountain constitute the five strongest explanatory variables in predicting the occurrence of Fe-Mn crusts. Most ‘hand-drawn’ knowledge-driven prospective areas are also considered prospective by the random forest algorithm with notable exceptions along the coast of the American continent. However, poor correlation is observed with knowledge-driven GIS-based criterion mapping as the Random Forest considers un-prospective most target areas from the GIS approach. Overall, the Random Forest prediction performs better in predicting a high chance of Fe-Mn crust occurrence in ISA licensed area than the GIS approach, which constitutes an external validation of the predictive quality of the random forest model

    Systematic analysis of the impact of slurry coating on manufacture of Li-ion battery electrodes via explainable machine learning

    Get PDF
    The manufacturing process strongly affects the electrochemical properties and performance of lithium-ion batteries. In particular, the flow of electrode slurry during the coating process is key to the final electrode properties and hence the characteristics of lithium-ion cells, however it is given little consideration. In this paper the effect of slurry structure is studied through the physical and rheological properties and their impact on the final electrode characteristics, for a graphite anode. As quantifying the impact of the large number of interconnected control variables on the electrode is a challenging task via traditional trial-and-error approaches, an explainable machine learning methodology as well as a systematic statistical analysis method is proposed for comprehensive assessments. The analysis is based upon an experimental dataset in lab-scale involving 9 main factors and 6 interest variables which cover practical range of variables through various combinations. While the predictability of response variables is evaluated via linear and nonlinear models, complementary techniques are utilised for variables importance, contribution, and first and second order effects to increase the model transparency. While coating gap is identified as the most influential factor for all considered responses, other subtle relationships are also extracted, highlighting that dimensionless numbers can serve as strong predictors for models. The impact of slurry viscosity and surface tension on electrode thickness, coat weight and porosity are also extracted, demonstrating their importance for electrode quality. These variables have been rarely considered in previous works, as the relationships are difficult to extract by trial and error due to interdependencies. Here we demonstrate how model-based analysis can overcome these difficulties and pave the way towards an optimised electrode manufacturing process of next generation Lithium-ion batteries
    • …
    corecore