104 research outputs found

    Methods to Improve the Prediction Accuracy and Performance of Ensemble Models

    Get PDF
    The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools. The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model. To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments. The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis. To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabetics’ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy. The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this thesis

    Statistical/climatic models to predict and project extreme precipitation events dominated by large-scale atmospheric circulation over the central-eastern China

    Get PDF
    Global warming has posed non-negligible effects on regional extreme precipitation changes and increased the uncertainties when meteorologists predict such extremes. More importantly, floods, landslides, and waterlogging caused by extreme precipitation have had catastrophic societal impacts and led to steep economic damages across the world, in particular over central-eastern China (CEC), where heavy precipitation due to the Meiyu-front and typhoon activities often causes flood disaster. There is mounting evidence that the anomaly atmospheric circulation systems and water vapor transport have a dominant role in triggering and maintaining the processes of regional extreme precipitation. Both understanding and accurately predicting extreme precipitation events based on these anomalous signals are hot issues in the field of hydrological research. In this thesis, the self-organizing map (SOM) and event synchronization were used to cluster the large-scale atmospheric circulation reflected by geopotential height at 500 hPa and to quantify the level of synchronization between the identified circulation patterns with extreme precipitation over CEC. With the understanding of which patterns were associated with extreme precipitation events, and corresponding water vapor transport fields, a hybrid deep learning model of multilayer perceptron and convolutional neural networks (MLP-CNN) was proposed to achieve the binary predictions of extreme precipitation. The inputs to MLP-CNN were the anomalous fields of GP at 500 hPa and vertically integrated water vapor transport (IVT). Compared with the original MLP, CNN, and two other machine learning models (random forest and support vector machine), MLP-CNN showed the best performance. Additionally, since the coarse spatial resolution of global circulation models and its large biases in extremes precipitation estimations, a new precipitation downscaling framework that combination of ensemble-learning and nonhomogeneous hidden Markov model (Ensemble-NHMM) was developed, to improve the reliabilities of GCMs in historical simulations and future projection. The performances of downscaled precipitation from reanalysis and GCM datasets were validated against the gauge observations and also compared with the results of traditional NHMM. Finally, the Ensemble-NHMM downscaling model was applied to future scenario data of GCM. On the projections of change trends in precipitation over CEC in the early-, medium- and late- 21st centuries under different emission scenarios, the possible causes were discussed in term of both thermodynamic and dynamic factors. Main results are enumerated as follows. (1) The large-scale atmospheric circulation patterns and associated water vapor transport fields synchronized with extreme precipitation events over CEC were quantitatively identified, as well as the contribution of circulation pattern changes to extreme precipitation changes and their teleconnection with the interdecadal modes of the ocean. Firstly, based on the nonparametric Pettitt test, it was found that 23% of rain gauges had significant abrupt changes in the annual extreme precipitation from 1960 to 2015. The average change point in the annual extreme precipitation frequency and amount occurred near 1989. Complex network analysis showed that the rain gauges highly synchronized on extreme precipitation events can be clustered into four clusters based on modularity information. Secondly, the dominant circulation patterns over CEC were robustly identified based on the SOM. From the period 1960–1989 to 1990–2015, the categories of identified circulation patterns generally remain almost unchanged. Among these, the circulation patterns characterized by obvious positive anomalies of 500 hPa geopotential height over the Eastern Eurasia continent and negative values over the surrounding oceans are highly synchronized with extreme precipitation events. An obvious water vapor channel originating from the northern Indian Ocean driven by the southwesterly airflow was observed for the representative circulation patterns (synchronized with extreme precipitation). Finally, the circulation pattern changes produced an increase in extreme precipitation frequency from 1960–1989 to 1990–2015. Empirical mode decomposition of the annual frequency variation signals in the representative circulation pattern showed that the 2–4 yr oscillation in the annual frequency was closely related to the phase of El Niño and Southern Oscillation (ENSO); while the 20–25 yr and 42–50 yr periodic oscillations were responses to the Pacific Decadal Oscillation and the Atlantic Multidecadal Oscillation. (2) A regional extreme precipitation prediction model was constructed. Two deep learning models-MLP and CNN were linearly stacked and used two atmospheric variables associated with extreme precipitation, that is, geopotential height at 500 hPa and IVT. The hybrid model can learn both the local-scale information with MLP and large-scale circulation information with CNN. Validation results showed that the MLP-CNN model can predict extreme or non-extreme precipitation days with an overall accuracy of 86%. The MLP-CNN also showed excellent seasonal transferability with an 81% accuracy on the testing set from different seasons of the training set. MLP-CNN significantly outperformed over other machine learning models, including MLP, CNN, random forest, and support vector machine. Additionally, the MLP-CNN can be used to produce precursor signals by 1 to 2 days, though the accuracy drops quickly as the number of precursor days increases. (3) The GCM seriously underestimated extreme precipitation over CEC but showed convincing results for reproducing large-scale atmospheric circulation patterns. The accuracies of 10 GCMs in extreme precipitation and large-scale atmospheric circulation simulations were evaluated. First, five indices were selected to measure the characteristics of extreme precipitation and the performances of GCMs were compared to the gauge-based daily precipitation analysis dataset over the Chinese mainland. The results showed that except for FGOALS-g3, most GCMs can reproduce the spatial distribution characteristics of the average precipitation from 1960 to 2015. However, all GCMs failed to accurately estimate the extreme precipitation with large underestimation (relative bias exceeds 85%). In addition, using the circulation patterns identified by the fifth-generation reanalysis data (ERA5) as benchmarks, GCMs can reproduce most CP types for the periods 1960–1989 and 1990–2015. In terms of the spatial similarity of the identified CPs, MPI-ESM1-2-HR was superior. (4) To improve the reliabilities of precipitation simulations and future projections from GCMs, a new statistical downscaling framework was proposed. This framework comprises two models, ensemble learning and NHMM. First, the extreme gradient boosting (XGBoost) and random forest (RF) were selected as the basic- and meta- classifiers for constructing the ensemble learning model. Based on the top 50 principal components of GP at 500 hPa and IVT, this model was trained to predict the occurrence probabilities for the different levels of daily precipitation (no rain, very light, light, moderate, and heavy precipitation) aggregated by multi-sites. Confusion matrix results showed that the ensemble learning model had sufficient accuracy (>88%) in classifying no rain or rain days and (>83%) predicting moderate precipitation events. Subsequently, precipitation downscaling was done using the probability sequences of daily precipitation as large-scale predictors to NHMM. Statistical metrics showed that the Ensemble-NHMM downscaled results matched best to the gauge observations in precipitation variabilities and extreme precipitation simulations, compared with the result from the one that directly used circulation variables as predictors. Finally, the downscaling model also performed well in the historical simulations of MPI-ESM1-2-HR, which reproduced the change trends of annual precipitation and the means of total extreme precipitation index. (5) Three climate scenarios with different Shared Socioeconomic Pathways and Representative Concentration Pathways (SSPs) were selected to project the future precipitation change trends. The Ensemble-NHMM downscaling model was applied to the scenario data from MPI-ESM1-2-HR. Projection results showed that the CEC would receive more precipitation in the future by ~30% through the 2075–2100 period. Compared to the recent 26-year epoch (1990–2015), the frequency and magnitude of extreme precipitation would increase by 21.9–48.1% and 12.3–38.3% respectively under the worst emission scenario (SSP585). In particular, the south CEC region is projected to receive more extreme precipitation than the north. Investigations of thermodynamic and dynamic factors showed that climate warming would increase the probability of stronger water vapor convergence over CEC. More wet weather states due to the enhanced water vapor transport, as well as the increased favoring large-scale atmospheric circulation and the strengthen pressure gradient would be the factors for the increased precipitation

    Machine learning methods for the estimation of weather and animal-related power outages on overhead distribution feeders

    Get PDF
    Doctor of PhilosophyDepartment of Electrical and Computer EngineeringSanjoy Das and Anil PahwaBecause a majority of day-to-day activities rely on electricity, it plays an important role in daily life. In this digital world, most of the people’s life depends on electricity. Without electricity, the flip of a switch would no longer produce instant light, television or refrigerators would be nonexistent, and hundreds of conveniences often taken for granted would be impossible. Electricity has become a basic necessity, and so any interruption in service due to disturbances in power lines causes a great inconvenience to customers. Customers and utility commissions expect a high level of reliability. Power distribution systems are geographically dispersed and exposure to environment makes them highly vulnerable part of power systems with respect to failures and interruption of service to customers. Following the restructuring and increased competition in the electric utility industry, distribution system reliability has acquired larger significance. Better understanding of causes and consequences of distribution interruptions is helpful in maintaining distribution systems, designing reliable systems, installing protection devices, and environmental issues. Various events, such as equipment failure, animal activity, tree fall, wind, and lightning, can negatively affect power distribution systems. Weather is one of the primary causes affecting distribution system reliability. Unfortunately, as weather-related outages are highly random, predicting their occurrence is an arduous task. To study the impact of weather on overhead distribution system several models, such as linear and exponential regression models, neural network model, and ensemble methods are presented in this dissertation. The models were extended to study the impact of animal activity on outages in overhead distribution system. Outage, lightning, and weather data for four different cities in Kansas of various sizes from 2005 to 2011 were provided by Westar Energy, Topeka, and state climate office at Kansas State University weather services. Models developed are applied to estimate daily outages. Performance tests shows that regression and neural network models are able to estimate outages well but failed to estimate well in lower and upper range of observed values. The introduction of committee machines inspired by the ‘divide & conquer” principle overcomes this problem. Simulation results shows that mixture of experts model is more effective followed by AdaBoost model in estimating daily outages. Similar results on performance of these models were found for animal-caused outages

    Scalable and Ensemble Learning for Big Data

    Get PDF
    University of Minnesota Ph.D. dissertation. May 2019. Major: Electrical/Computer Engineering. Advisor: Georgios Giannakis. 1 computer file (PDF); xi, 126 pages.The turn of the decade has trademarked society and computing research with a ``data deluge.'' As the number of smart, highly accurate and Internet-capable devices increases, so does the amount of data that is generated and collected. While this sheer amount of data has the potential to enable high quality inference, and mining of information, it introduces numerous challenges in the processing and pattern analysis, since available statistical inference and machine learning approaches do not necessarily scale well with the number of data and their dimensionality. In addition to the challenges related to scalability, data gathered are often noisy, dynamic, contaminated by outliers or corrupted to specifically inhibit the inference task. Moreover, many machine learning approaches have been shown to be susceptible to adversarial attacks. At the same time, the cost of cloud and distributed computing is rapidly declining. Therefore, there is a pressing need for statistical inference and machine learning tools that are robust to attacks and scale with the volume and dimensionality of the data, by harnessing efficiently the available computational resources. This thesis is centered on analytical and algorithmic foundations that aim to enable statistical inference and data analytics from large volumes of high-dimensional data. The vision is to establish a comprehensive framework based on state-of-the-art machine learning, optimization and statistical inference tools to enable truly large-scale inference, which can tap on the available (possibly distributed) computational resources, and be resilient to adversarial attacks. The ultimate goal is to both analytically and numerically demonstrate how valuable insights from signal processing can lead to markedly improved and accelerated learning tools. To this end, the present thesis investigates two main research thrusts: i) Large-scale subspace clustering; and ii) unsupervised ensemble learning. The aforementioned research thrusts introduce novel algorithms that aim to tackle the issues of large-scale learning. The potential of the proposed algorithms is showcased by rigorous theoretical results and extensive numerical tests

    Cascade of classifier ensembles for reliable medical image classification

    Get PDF
    Medical image analysis and recognition is one of the most important tools in modern medicine. Different types of imaging technologies such as X-ray, ultrasonography, biopsy, computed tomography and optical coherence tomography have been widely used in clinical diagnosis for various kinds of diseases. However, in clinical applications, it is usually time consuming to examine an image manually. Moreover, there is always a subjective element related to the pathological examination of an image. This produces the potential risk of a doctor to make a wrong decision. Therefore, an automated technique will provide valuable assistance for physicians. By utilizing techniques from machine learning and image analysis, this thesis aims to construct reliable diagnostic models for medical image data so as to reduce the problems faced by medical experts in image examination. Through supervised learning of the image data, the diagnostic model can be constructed automatically. The process of image examination by human experts is very difficult to simulate, as the knowledge of medical experts is often fuzzy and not easy to be quantified. Therefore, the problem of automatic diagnosis based on images is usually converted to the problem of image classification. For the image classification tasks, using a single classifier is often hard to capture all aspects of image data distributions. Therefore, in this thesis, a classifier ensemble based on random subspace method is proposed to classify microscopic images. The multi-layer perceptrons are used as the base classifiers in the ensemble. Three types of feature extraction methods are selected for microscopic image description. The proposed method was evaluated on two microscopic image sets and showed promising results compared with the state-of-art results. In order to address the classification reliability in biomedical image classification problems, a novel cascade classification system is designed. Two random subspace based classifier ensembles are serially connected in the proposed system. In the first stage of the cascade system, an ensemble of support vector machines are used as the base classifiers. The second stage consists of a neural network classifier ensemble. Using the reject option, the images whose classification results cannot achieve the predefined rejection threshold at the current stage will be passed to the next stage for further consideration. The proposed cascade system was evaluated on a breast cancer biopsy image set and two UCI machine learning datasets, the experimental results showed that the proposed method can achieve high classification reliability and accuracy with small rejection rate. Many computer aided diagnosis systems face the problem of imbalance data. The datasets used for diagnosis are often imbalanced as the number of normal cases is usually larger than the number of the disease cases. Classifiers that generalize over the data are not the most appropriate choice in such an imbalanced situation. To tackle this problem, a novel one-class classifier ensemble is proposed. The Kernel Principle Components are selected as the base classifiers in the ensemble; the base classifiers are trained by different types of image features respectively and then combined using a product combining rule. The proposed one-class classifier ensemble is also embedded into the cascade scheme to improve classification reliability and accuracy. The proposed method was evaluated on two medical image sets. Favorable results were obtained comparing with the state-of-art results

    Sentiment Analysis in Digital Spaces: An Overview of Reviews

    Full text link
    Sentiment analysis (SA) is commonly applied to digital textual data, revealing insight into opinions and feelings. Many systematic reviews have summarized existing work, but often overlook discussions of validity and scientific practices. Here, we present an overview of reviews, synthesizing 38 systematic reviews, containing 2,275 primary studies. We devise a bespoke quality assessment framework designed to assess the rigor and quality of systematic review methodologies and reporting standards. Our findings show diverse applications and methods, limited reporting rigor, and challenges over time. We discuss how future research and practitioners can address these issues and highlight their importance across numerous applications.Comment: 44 pages, 4 figures, 6 tables, 3 appendice

    User behavior modeling: Towards solving the duality of interpretability and precision

    Get PDF
    User behavior modeling has become an indispensable tool with the proliferation of socio-technical systems to provide a highly personalized experience to the users. These socio-technical systems are used in sectors as diverse as education, health, law to e-commerce, and social media. The two main challenges for user behavioral modeling are building an in-depth understanding of online user behavior and using advanced computational techniques to capture behavioral uncertainties accurately. This thesis addresses both these challenges by developing interpretable models that aid in understanding user behavior at scale and by developing sophisticated models that perform accurate modeling of user behavior. Specifically, we first propose two distinct interpretable approaches to understand explicit and latent user behavioral characteristics. Firstly, in Chapter 3, we propose an interpretable Gaussian Hidden Markov Model-based cluster model leveraging user activity data to identify users with similar patterns of behavioral evolution. We apply our approach to identify researchers with similar patterns of research interests evolution. We further show the utility of our interpretable framework to identify differences in gender distribution and the value of awarded grants among the identified archetypes. We also demonstrate generality of our approach by applying on StackExchange to identify users with a similar change in usage patterns. Next in Chapter 4, we estimate user latent behavioral characteristics by leveraging user-generated content (questions or answers) in Community Question Answering (CQA) platforms. In particular, we estimate the latent aspect-based reliability representations of users in the forum to infer the trustworthiness of their answers. We also simultaneously learn the semantic meaning of their answers through text representations. We empirically show that the estimated behavioral representations can accurately identify topical experts. We further propose to improve current behavioral models by modeling explicit and implicit user-to-user influence on user behavior. To this end, in Chapter 5, we propose a novel attention-based approach to incorporate influence from both user's social connections and other similar users on their preferences in recommender systems. Additionally, we also incorporate implicit influence in the item space by considering frequently co-occurring and similar feature items. Our modular approach captures the different influences efficiently and later fuses them in an interpretable manner. Extensive experiments show that incorporating user-to-user influence outperforms approaches relying on solely user data. User behavior remains broadly consistent across the platform. Thus, incorporating user behavioral information can be beneficial to estimate the characteristics of user-generated content. To verify it, in Chapter 6, we focus on the task of best answer selection in CQA forums that traditionally only considers textual features. We induce multiple connections between user-generated content, i.e., answers, based on the similarity and contrast in the behavior of authoring users in the platform. These induced connections enable information sharing between connected answers and, consequently, aid in estimating the quality of the answer. We also develop convolution operators to encode these semantically different graphs and later merge them using boosting. We also proposed an alternative approach to incorporate user behavioral information by jointly estimating the latent behavioral representations of user with text representations in Chapter 7. We evaluate our approach on the offensive language prediction task on Twitter. Specially, we learn an improved text representation by leveraging syntactic dependencies between the words in the tweet. We also estimate the abusive behavior of users, i.e., their likelihood of posting offensive content online from their tweets. We further show that combining the textual and user behavioral features can outperform the sophisticated textual baselines

    Human-in-the-Loop Learning From Crowdsourcing and Social Media

    Get PDF
    Computational social studies using public social media data have become more and more popular because of the large amount of user-generated data available. The richness of social media data, coupled with noise and subjectivity, raise significant challenges for computationally studying social issues in a feasible and scalable manner. Machine learning problems are, as a result, often subjective or ambiguous when humans are involved. That is, humans solving the same problems might come to legitimate but completely different conclusions, based on their personal experiences and beliefs. When building supervised learning models, particularly when using crowdsourced training data, multiple annotations per data item are usually reduced to a single label representing ground truth. This inevitably hides a rich source of diversity and subjectivity of opinions about the labels. Label distribution learning associates for each data item a probability distribution over the labels for that item, thus it can preserve diversities of opinions, beliefs, etc. that conventional learning hides or ignores. We propose a humans-in-the-loop learning framework to model and study large volumes of unlabeled subjective social media data with less human effort. We study various annotation tasks given to crowdsourced annotators and methods for aggregating their contributions in a manner that preserves subjectivity and disagreement. We introduce a strategy for learning label distributions with only five-to-ten labels per item by aggregating human-annotated labels over multiple, semantically related data items. We conduct experiments using our learning framework on data related to two subjective social issues (work and employment, and suicide prevention) that touch many people worldwide. Our methods can be applied to a broad variety of problems, particularly social problems. Our experimental results suggest that specific label aggregation methods can help provide reliable representative semantics at the population level

    From metaheuristics to learnheuristics: Applications to logistics, finance, and computing

    Get PDF
    Un gran nombre de processos de presa de decisions en sectors estratègics com el transport i la producció representen problemes NP-difícils. Sovint, aquests processos es caracteritzen per alts nivells d'incertesa i dinamisme. Les metaheurístiques són mètodes populars per a resoldre problemes d'optimització difícils en temps de càlcul raonables. No obstant això, sovint assumeixen que els inputs, les funcions objectiu, i les restriccions són deterministes i conegudes. Aquests constitueixen supòsits forts que obliguen a treballar amb problemes simplificats. Com a conseqüència, les solucions poden conduir a resultats pobres. Les simheurístiques integren la simulació a les metaheurístiques per resoldre problemes estocàstics d'una manera natural. Anàlogament, les learnheurístiques combinen l'estadística amb les metaheurístiques per fer front a problemes en entorns dinàmics, en què els inputs poden dependre de l'estructura de la solució. En aquest context, les principals contribucions d'aquesta tesi són: el disseny de les learnheurístiques, una classificació dels treballs que combinen l'estadística / l'aprenentatge automàtic i les metaheurístiques, i diverses aplicacions en transport, producció, finances i computació.Un gran número de procesos de toma de decisiones en sectores estratégicos como el transporte y la producción representan problemas NP-difíciles. Frecuentemente, estos problemas se caracterizan por altos niveles de incertidumbre y dinamismo. Las metaheurísticas son métodos populares para resolver problemas difíciles de optimización de manera rápida. Sin embargo, suelen asumir que los inputs, las funciones objetivo y las restricciones son deterministas y se conocen de antemano. Estas fuertes suposiciones conducen a trabajar con problemas simplificados. Como consecuencia, las soluciones obtenidas pueden tener un pobre rendimiento. Las simheurísticas integran simulación en metaheurísticas para resolver problemas estocásticos de una manera natural. De manera similar, las learnheurísticas combinan aprendizaje estadístico y metaheurísticas para abordar problemas en entornos dinámicos, donde los inputs pueden depender de la estructura de la solución. En este contexto, las principales aportaciones de esta tesis son: el diseño de las learnheurísticas, una clasificación de trabajos que combinan estadística / aprendizaje automático y metaheurísticas, y varias aplicaciones en transporte, producción, finanzas y computación.A large number of decision-making processes in strategic sectors such as transport and production involve NP-hard problems, which are frequently characterized by high levels of uncertainty and dynamism. Metaheuristics have become the predominant method for solving challenging optimization problems in reasonable computing times. However, they frequently assume that inputs, objective functions and constraints are deterministic and known in advance. These strong assumptions lead to work on oversimplified problems, and the solutions may demonstrate poor performance when implemented. Simheuristics, in turn, integrate simulation into metaheuristics as a way to naturally solve stochastic problems, and, in a similar fashion, learnheuristics combine statistical learning and metaheuristics to tackle problems in dynamic environments, where inputs may depend on the structure of the solution. The main contributions of this thesis include (i) a design for learnheuristics; (ii) a classification of works that hybridize statistical and machine learning and metaheuristics; and (iii) several applications for the fields of transport, production, finance and computing
    corecore