3,782 research outputs found

    Generalised Decision Level Ensemble Method for Classifying Multi-media Data

    Get PDF
    In recent decades, multimedia data have been commonly generated and used in various domains, such as in healthcare and social media due to their ability of capturing rich information. But as they are unstructured and separated, how to fuse and integrate multimedia datasets and then learn from them eectively have been a main challenge to machine learning. We present a novel generalised decision level ensemble method (GDLEM) that combines the multimedia datasets at decision level. After extracting features from each of multimedia datasets separately, the method trains models independently on each media dataset and then employs a generalised selection function to choose the appropriate models to construct a heterogeneous ensemble. The selection function is dened as a weighted combination of two criteria: the accuracy of individual models and the diversity among the models. The framework is tested on multimedia data and compared with other heterogeneous ensembles. The results show that the GDLEM is more exible and eective

    Machine Learning Ensemble Methods for Classifying Multi-media Data

    Get PDF
    Multimedia data have, over recent years, been produced in many fields. They have important applications for such diverse areas as social media and healthcare, due to their capacity to capture rich information. However, their unstructured and separated nature gives rise to various problems. In particular, fusing and integrating multi-media datasets and finding effective ways to learn from them have proven to be major challenges for machine learning. In this thesis we investigated the development of the ensemble methods for classifying multi-media data in two key aspects: data fusion and model selection. For the data fusion, we devised two different strategies. The first one is the Feature Level Ensemble Method (FLEM) that aggregates all the features into a single dataset and then generates the models to build ensembles using this dataset. The second one is the Decision Level Ensemble Method (DLEM) that generates the models from each sub dataset individually and then aggregates their outputs with a decision fusion function. For the model selection we derived four different model selection rules. The first rule, R0, uses just the accuracy to select models. The rules R1 and R2 use firstly accuracy and then diversity to select models. In R3, we defined a generalised function that combines the accuracy and diversity with different weights to select models to build an ensemble. Our methods were compared with existing well known ensemble methods using the same dataset and another dataset that became available after our methods had been developed. The results were critically analysed and the statistical significance analyses of the results show that our methods had better performance in general and the generalised R3 is the most effective rule in building ensembles

    Quantitative Modelling of Climate Change Impact on Hydro-climatic Extremes

    Get PDF
    In recent decades, climate change has caused a more volatile climate leading to more extreme events such as severe rainstorms, heatwaves and floods which are likely to become more frequent. Aiming to reveal climate change impact on the hydroclimatic extremes in a quantitative sense, this thesis presents a comprehensive analysis from three main strands. The first strand focuses on developing a quantitative modelling framework to quantify the spatiotemporal variation of hydroclimatic extremes for the areas of concern. A spatial random sampling toolbox (SRS-GDA) is designed for randomizing the regions of interest (ROIs) with different geographic locations, sizes, shapes and orientations where the hydroclimatic extremes are parameterised by a nonstationary distribution model whose parameters are assumed to be time-varying. The parameters whose variation with respect to different spatial features of ROIs and climate change are finally quantified by various statistical models such as the generalised linear model. The framework is applied to quantify the spatiotemporal variation of rainfall extremes in Great Britain (GB) and Australia and is further used in a comparison study to quantify the bias between observed and climate projected extremes. Then the framework is extended to a multivariate framework to estimate the time-varying joint probability of more than one hydroclimatic variable in the perspective of non-stationarity. A case study for evaluating compound floods in Ho Chi Minh City, Vietnam is applied for demonstrating the application of the framework. The second strand aims to recognise, classify and track the development of hydroclimatic extremes (e.g., severe rainstorms) by developing a stable computer algorithm (i.e., the SPER toolbox). The SPER toolbox can detect the boundary of the event area, extract the spatial and physical features of the event, which can be used not only for pattern recognition but also to support AI-based training for labelling/cataloguing the pattern from the large-sized, grid-based, multi-scaled environmental datasets. Three illustrative cases are provided; and as the front-end of AI study, an example for training a convolution neural network is given for classifying the rainfall extremes in the last century of GB. The third strand turns to support decision making by building both theory-driven and data-driven decision-making models to simulate the decisions in the context of flood forecasting and early warning, using the data collected via laboratory-style experiments based on various information of probabilistic flood forecasts and consequences. The research work demonstrated in this thesis has been able to bridge the knowledge gaps in the related field and it also provides a precritical insight in managing future risks arising from hydroclimatic extremes, which makes perfect sense given the urgent situation of climate change and the related challenges our societies are facing

    Tree-based ensemble methods for predicting PV power generation and their comparison with support vector regression

    Get PDF
    The variability of renewable energy resources, due to characteristic weather fluctuations, introduces uncertainty in generation output that are greater than the conventional energy reserves the grid uses to deal with the relatively predictable uncertainties in demand. The high variability of renewable generation makes forecasting critical for optimal balancing and dispatch of generation plants in a smarter grid. The challenge is to improve the accuracy and the confidence level of forecasts at a reasonable computational cost. Ensemble methods such as random forest (RF) and extra trees (ET) are well suited for predicting stochastic photovoltaic (PV) generation output as they reduce variance and bias by combining several machine learning techniques while improving the stability; i.e. generalisation capabilities. This paper investigated the accuracy, stability and computational cost of RF and ET for predicting hourly PV generation output, and compared their performance with support vector regression (SVR), a supervised machine learning technique. All developed models have comparable predictive power and are equally applicable for predicting hourly PV output. Despite their comparable predictive power, ET outperformed RF and SVR in terms of computational cost. The stability and algorithmic efficiency of ETs make them an ideal candidate for wider deployment in PV output forecasting

    On the class overlap problem in imbalanced data classification.

    Get PDF
    Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance

    Learning from class-imbalanced data: overlap-driven resampling for imbalanced data classification.

    Get PDF
    Classification of imbalanced datasets has attracted substantial research interest over the past years. This is because imbalanced datasets are common in several domains such as health, finance and security, but learning algorithms are generally not designed to handle them. Many existing solutions focus mainly on the class distribution problem. However, a number of reports showed that class overlap had a higher negative impact on the learning process than class imbalance. This thesis thoroughly explores the impact of class overlap on the learning algorithm and demonstrates how elimination of class overlap can effectively improve the classification of imbalanced datasets. Novel undersampling approaches were developed with the main objective of enhancing the presence of minority class instances in the overlapping region. This is achieved by identifying and removing majority class instances potentially residing in such a region. Seven methods under the two different approaches were designed for the task. Extensive experiments were carried out to evaluate the methods on simulated and well-known real-world datasets. Results showed that substantial improvement in the classification accuracy of the minority class was obtained with favourable trade-offs with the majority class accuracy. Moreover, successful application of the methods in predictive diagnostics of diseases with imbalanced records is presented. These novel overlap-based approaches have several advantages over other common resampling methods. First, the undersampling amount is independent of class imbalance and proportional to the degree of overlap. This could effectively address the problem of class overlap while reducing the effect of class imbalance. Second, information loss is minimised as instance elimination is contained within the problematic region. Third, adaptive parameters enable the methods to be generalised across different problems. It is also worth pointing out that these methods provide different trade-offs, which offer more alternatives to real-world users in selecting the best fit solution to the problem
    corecore