236 research outputs found

    Ensembles of probability estimation trees for customer churn prediction

    Get PDF
    Customer churn prediction is one of the most, important elements tents of a company's Customer Relationship Management, (CRM) strategy In tins study, two strategies are investigated to increase the lift. performance of ensemble classification models, i.e (1) using probability estimation trees (PETs) instead of standard decision trees as base classifiers; and (n) implementing alternative fusion rules based on lift weights lot the combination of ensemble member's outputs Experiments ale conducted lot font popular ensemble strategics on five real-life chin n data sets In general, the results demonstrate how lift performance can be substantially improved by using alternative base classifiers and fusion tides However: the effect vanes lot the (Idol cut ensemble strategies lit particular, the results indicate an increase of lift performance of (1) Bagging by implementing C4 4 base classifiets. (n) the Random Subspace Method (RSM) by using lift-weighted fusion rules, and (in) AdaBoost, by implementing both

    Listening between the Lines: Learning Personal Attributes from Conversations

    Full text link
    Open-domain dialogue agents must be able to converse about many topics while incorporating knowledge about the user into the conversation. In this work we address the acquisition of such knowledge, for personalization in downstream Web applications, by extracting personal attributes from conversations. This problem is more challenging than the established task of information extraction from scientific publications or Wikipedia articles, because dialogues often give merely implicit cues about the speaker. We propose methods for inferring personal attributes, such as profession, age or family status, from conversations using deep learning. Specifically, we propose several Hidden Attribute Models, which are neural networks leveraging attention mechanisms and embeddings. Our methods are trained on a per-predicate basis to output rankings of object values for a given subject-predicate combination (e.g., ranking the doctor and nurse professions high when speakers talk about patients, emergency rooms, etc). Experiments with various conversational texts including Reddit discussions, movie scripts and a collection of crowdsourced personal dialogues demonstrate the viability of our methods and their superior performance compared to state-of-the-art baselines.Comment: published in WWW'1

    Listening between the Lines: Learning Personal Attributes from Conversations

    No full text
    Open-domain dialogue agents must be able to converse about many topics while incorporating knowledge about the user into the conversation. In this work we address the acquisition of such knowledge, for personalization in downstream Web applications, by extracting personal attributes from conversations. This problem is more challenging than the established task of information extraction from scientific publications or Wikipedia articles, because dialogues often give merely implicit cues about the speaker. We propose methods for inferring personal attributes, such as profession, age or family status, from conversations using deep learning. Specifically, we propose several Hidden Attribute Models, which are neural networks leveraging attention mechanisms and embeddings. Our methods are trained on a per-predicate basis to output rankings of object values for a given subject-predicate combination (e.g., ranking the doctor and nurse professions high when speakers talk about patients, emergency rooms, etc). Experiments with various conversational texts including Reddit discussions, movie scripts and a collection of crowdsourced personal dialogues demonstrate the viability of our methods and their superior performance compared to state-of-the-art baselines

    TurnGPT: a Transformer-based Language Model for Predicting Turn-taking in Spoken Dialog

    Full text link
    Syntactic and pragmatic completeness is known to be important for turn-taking prediction, but so far machine learning models of turn-taking have used such linguistic information in a limited way. In this paper, we introduce TurnGPT, a transformer-based language model for predicting turn-shifts in spoken dialog. The model has been trained and evaluated on a variety of written and spoken dialog datasets. We show that the model outperforms two baselines used in prior work. We also report on an ablation study, as well as attention and gradient analyses, which show that the model is able to utilize the dialog context and pragmatic completeness for turn-taking prediction. Finally, we explore the model's potential in not only detecting, but also projecting, turn-completions.Comment: Accepted to Findings of ACL: EMNLP 202

    Theoretical and Methodological Advances in Semi-supervised Learning and the Class-Imbalance Problem

    Get PDF
    his paper focuses on the theoretical and practical generalization of two known and challenging situations from the field of machine learning to classification problems in which the assumption of having a single binary class is not fulfilled.semi-supervised learning is a technique that uses large amounts of unlabeled data to improve the performance of supervised learning when the labeled data set is very limited. Specifically, this work contributes with powerful and computationally efficient methodologies to learn, in a semi-supervised way, classifiers for multiple class variables. Also, the fundamental limits of semi-supervised learning in multi-class problems are investigated in a theoretical way. The problem of class unbalance appears when the target variables present a probability distribution unbalanced enough to distort the solutions proposed by the traditional supervised learning algorithms. In this project, a theoretical framework is proposed to separate the deviation produced by class unbalance from other factors that affect the accuracy of classifiers. This framework is mainly used to make a recommendation of classifier assessment metrics in this situation. Finally, a measure of the degree of class unbalance in a data set correlated with the loss of accuracy caused is also proposed

    Theoretical and methodological advances in semi-supervised learning and the class-imbalance problem.

    Get PDF
    201 p.Este trabajo se centra en la generalización teórica y práctica de dos situaciones desafiantes y conocidas del campo del aprendizaje automático a problemas de clasificación en los cuales la suposición de tener una única clase binaria no se cumple.Aprendizaje semi-supervisado es una técnica que usa grandes cantidades de datos no etiquetados para, así, mejorar el rendimiento del aprendizaje supervisado cuando el conjunto de datos etiquetados es muy acotado. Concretamente, este trabajo contribuye con metodologías potentes y computacionalmente eficientes para aprender, de forma semi-supervisada, clasificadores para múltiples variables clase. También, se investigan, de forma teórica, los límites fundamentales del aprendizaje semi-supervisado en problemas multiclase.El problema de desbalanceo de clases aparece cuando las variables objetivo presentan una distribución de probabilidad lo suficientemente desbalanceada como para desvirtuar las soluciones propuestas por los algoritmos de aprendizaje supervisado tradicionales. En este proyecto, se propone un marco teórico para separar la desvirtuación producida por el desbalanceo de clases de otros factores que afectan a la precisión de los clasificadores. Este marco es usado principalmente para realizar una recomendación de métricas de evaluación de clasificadores en esta situación. Por último, también se propone una medida del grado de desbalanceo de clases en un conjunto de datos correlacionada con la pérdida de precisión ocasionada.Intelligent Systems Grou

    GPGPU Reliability Analysis: From Applications to Large Scale Systems

    Get PDF
    Over the past decade, GPUs have become an integral part of mainstream high-performance computing (HPC) facilities. Since applications running on HPC systems are usually long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPC systems face severe resilience challenges as progressing towards exascale computing, it is imperative to develop a better understanding of the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications. To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results. To understand the effects of soft errors at the application level, an effective fault-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the tremendous number of fault injection locations to a manageable size while still preserving remarkable accuracy. This framework is validated with both single-bit and multi-bit fault models for various GPGPU benchmarks. Lastly, taking advantage of the proposed fault-injection framework, this dissertation develops a hierarchical approach to understanding the error resilience characteristics of GPGPU applications at kernel, CTA, and warp levels. In addition, given that some corrupted application outputs due to soft errors may be acceptable, we present a use case to show how to enable low-overhead yet reliable GPU computing for GPGPU applications

    Predictive Modeling of Avian Influenza in Wild Birds

    Get PDF
    Thesis (Ph.D.) University of Alaska Fairbanks, 2013Over the past 20 years, highly pathogenic avian influenza (HPAI), specifically Eurasian H5N1 subtypes, caused economic losses to the poultry industry and sparked fears of a human influenza pandemic. Avian influenza virus (AIV) is widespread in wild bird populations in the low-pathogenicity form (LPAI), and wild birds are thought to be the reservoir for AIV. To date, however, nearly all predictive models of AIV focus on domestic poultry and HPAI H5N1 at a small country or regional scale. Clearly, there is a need and an opportunity to explore AIV in wild birds using data-mining and machinelearning techniques. I developed predictive models using the Random Forests algorithm to describe the ecological niche of avian influenza in wild birds. In “Chapter 2 - Predictive risk modeling of avian influenza around the Pacific Rim”, I demonstrated that it was possible to separate an AIV-positivity signal from general surveillance effort. Cold winters, high temperature seasonality, and a long distance from coast were important predictors. In “Chapter 3 - A global model of avian influenza prediction in wild birds: the importance of northern regions”, northern regions remained areas of high predicted occurrence even when using a global dataset of AIV. In surveillance data, the percentage of AIV-positive samples is typically very low, which can hamper machine-learning. For “Chapter 4 - Modeling avian influenza with Random Forests: under-sampling and model selection for unbalanced prevalence in surveillance data” I wrote custom code in R statistical programming language to evaluate a balancing algorithm, a model selection algorithm, and an under-sampling method for their effects on model accuracy. Repeated random iv sub-sampling was found to be the most reliable way to improved unbalanced datasets. In these models cold regions consistently bore the highest relative predicted occurrence scores for AIV-positivity and describe a niche for LPAI that is distinct from the niche for HPAI in domestic poultry. These studies represent a novel, initial attempt at constructing models for LPAI in wild birds and demonstrated high predictive power.TABLE OF CONTENTS Page SIGNATURE PAGE ... i TITLE PAGE ... ii ABSTRACT ... iii TABLE OF CONTENTS ... v LIST OF FIGURES ... viii LIST OF TABLES ... x LIST OF ADDITIONAL MATERIALS ... x LIST OF APPENDICES ... xi DEDICATION ... xiii ACKNOWLEDGEMENTS ... xiv CHAPTER 1: General Introduction ... 1 Avian influenza virus, transmission, and pandemic potential ... 1 Modeling AIV ... 7 Specific aims ... 10 FIGURES ... 14 LITERATURE CITED .. 17 CHAPTER 2: Predictive risk modeling of avian influenza around the Pacific Rim ... 26 ABSTRACT ... 26 INTRODUCTION ... 28 MATERIALS AND METHODS ... 31 Data layers ... 31 Modeling methods ... 33 Model evaluation ... 35 RESULTS ... 35 DISCUSSION ... 37 ACKNOWLEDGEMENTS ... 40 TABLES ... 42 vi FIGURES ... 45 LITERATURE CITED .. 49 CHAPTER 3: A global model of avian influenza prediction in wild birds: the importance of northern regions ... 54 ABSTRACT ... 54 INTRODUCTION ... 55 MATERIALS AND METHODS ... 57 Wild bird data ... 57 Environmental variable layers ... 57 Defining the outbreak niche ... 59 Predictive map ... 60 RESULTS ... 61 Important predictor variables ... 61 Ecological niche model ... 63 DISCUSSION ... 63 ACKNOWLEDGEMENTS ... 68 TABLES ... 69 FIGURES ... 72 LITERATURE CITED .. 76 CHAPTER 4: Modeling avian influenza with Random Forests: under-sampling and model selection for unbalanced prevalence in surveillance data ... 80 ABSTRACT ... 80 1. INTRODUCTION . 81 2. MATERIALS AND METHODS ... 86 2.1 Predictor variables ... 86 2.2 Wild bird data ... 87 2.3 Random Forests, balancing, and model selection ... 89 2.4 Predictive map .... 92 2.5 Statistical analyses ... 92 2.6 Variable importance ... 93 vii 2.7 Cross-model comparisons ... 94 2.8 Research design .. 94 3. RESULTS ... 95 3.1. Model Performance ... 95 3.2. Cross-model comparison ... 97 3.3. Variable importance ... 98 3.4 Predictive map ... 100 4. DISCUSSION ... 101 4.1. Random sub-sampling and model selection ... 101 4.2. Database comparisons ... 102 4.3. Predictive map ... 103 4.4. Important variables ... 104 4.5 Conclusions ....... 105 ACKNOWLEDGEMENTS ... 106 TABLES ... 107 FIGURES ... 113 LITERATURE CITED ... 123 CHAPTER 5: General Discussion ... 130 Overview ... 131 The LPAI niche vs. the HPAI niche ... 135 Technical aspects and software ... 138 Future work ... 140 Surveillance and Adaptive Management principles ... 144 FIGURES ... 146 LITERATURE CITED ... 147 APPENDICES ... 150 viii LIST OF FIGURES Page INTRODUCTION FIGURES Figure 1.1. Pacific Rim study area and wild bird surveillance locations ... 14 Figure 1.2. Global study area and wild bird surveillance locations ... 15 Figure 1.3. Pacific Rim study area and wild bird surveillance locations ... 16 CHAPTER 2 FIGURES Figure 2.1. Map of predicted relative occurrence index of avian influenza virus (AIV) in wild birds around the Pacific Rim study area and surveillance locations .. 45 Figure 2.2. Notched box plots for important variables. ... 46 Figure 2.3. Histogram density plots for important variables ... 47 Figure 2.4. Partial dependence plots for important variables ... 48 CHAPTER 3 FIGURES Figure 3.1. Histogram density plots for important variables ... 72 Figure 3.2. Partial dependence plots for important variables ... 73 Figure 3.3. Map of predicted relative occurence index of avian influenza virus (AIV) in wild birds and surveillance locations ... 75 CHAPTER 4 FIGURES Figure 4.1. Research design ... 113 Figure 4.2. Receiver Operating Characteristic (ROC) curves for experimental methods ... 114 Figure 4.3. Mean area under the receiver operating characteristic curves (AUC) of the four different experimental methods that generated them ... 115 Figure 4.4. Cross-model comparison results ... 116 Figure 4.5. Density plots for the mean temperature in April ... 117 ix Figure 4.6. Density plots for important variables ... 118 Figure 4.7. Partial dependence plots for important predictor variables ... 119 Figure 4.8. Map of predicted relative occurence index of avian influenza virus (AIV) in wild birds and surveillance locations around the Pacific Rim study area ... 121 Figure 4.9. A conceptual diagram illustrating differences between traditional and collaborative surveillance methods and their interaction with laboratory and machine-learning work. ... 122 GENERAL DISCUSSION FIGURES Figure 5.1. Density plot of latitude. ... 146 x LIST OF TABLES Page CHAPTER 2 TABLES Table 2.1. Predictor variables used to construct model of avian influenza in wild birds ...42 Table 2.2. Normalized importance scores for top predictor variables ...44 CHAPTER 3 TABLES Table 3.1. The predictor variables used by the Random Forests algorithm to create a global prediction map for avian influenza virus in wild birds ...69 CHAPTER 4 TABLES Table 4.1. Selected examples of the prevalence of birds testing positive for avian influenza virus (AIV) from wild bird surveillance projects ...107 Table 4.2. Predictor variables used by the Random Forests to create a prediction map for AIV in wild birds .108 Table 4.3. Descriptive summary table for databases. ...110 Table 4.4. Summary table for experimental methods. ...111 Table 4.5. Descriptive statistics for databases and models. ...112 LIST OF ADDITIONAL MATERIALS Additional Materials ... CD xi LIST OF APPENDICES Page Appendix A. List of bird species in the Alaska Asia Avian Influenza Research 2005-2007 database... 150 Appendix B. List of bird species from the NIH Influenza Research Database (IRD). ... 157 Appendix C. List of bird species in the Alaska Asia Avian Influenza Research 2005-2020 database ... 157 Appendix D. List of bird species in the Canada’s Inter-agency Wild Bird Influenza survey (CIWBI) database . 169 Appendix E. Global Layers.xml: Metadata for bioclimatic, anthropogenic, and geographic data layers ...CD Appendix F. Georeferenced Bird Data.xml: Metadata for Pacific Rim model (Chapter 1), global model (Chapter 2), and the four datasets used in Chapter 3 ...CD Appendix G. Global Layers (folder): bioclimatic, anthropogenic, and geographic data layers used in the PhD thesis “Mapping Avian Influenza in Wild Birds” Datasets (subfolder) Chapter 1 flupacV5.shp ...CD Chapter 2 globfluV6.shp ...CD Chapter 3 A3IRB.shp ...CD Chapter 3 ALL.shp ...CD Chapter 3 CIWBI.shp ...CD Chapter 3 UNIQUE.shp ...CD GEM landcover 2000 (subfolder) glc2000_v1_1_Grid: landcover ...CD GEM-Metadata.pdf...CD GLC2000_legend_summary.doc ...CD Last of the Wild (subfolder) hfp_global_geo_grid: Human Footprint ...CD hii_global_geo_grid: Human Influence Index ...CD ltw_global_geo: Last of the Wild ...CD xii livestock (subfolder) glbpgtotcor (subfolder): estimated pig density ...CD glbpototcor (subfolder): estimated poultry density ...CD sedac human world popn (subfolder) glfedens10: human population density ...CD WorldClim (subfolder) alt_30s_esri: elevation ...CD bio_30s_esri: bioclimatic variables ...CD prec_30s_esri: monthly precipitation means ...CD tmean_30s_esri: monthly temperature means ...CD WWF GLWD (subfolder) euc_hydro_1k: distance to hydrologic feature ...CD GLWD_Data_Documentation.pdf ...CD Appendix H. Example Code (folder) random subsetting 07112012.R ...CD rocr_code_071012.R ....CD Partial_plots 71712.R ...C
    corecore