1,319 research outputs found

    Absent Data Generating Classifier for Imbalanced Class Sizes

    Get PDF
    We propose an algorithm for two-class classification problems when the training data are imbalanced. This means the number of training instances in one of the classes is so low that the conventional classification algorithms become ineffective in detecting the minority class. We present a modification of the kernel Fisher discriminant analysis such that the imbalanced nature of the problem is explicitly addressed in the new algorithm formulation. The new algorithm exploits the properties of the existing minority points to learn the effects of other minority data points, had they actually existed. The algorithm proceeds iteratively by employing the learned properties and conditional sampling in such a way that it generates sufficient artificial data points for the minority set, thus enhancing the detection probability of the minority class. Implementing the proposed method on a number c©2015 Arash Pourhabib, Bani K. Mallick and Yu Ding. Article preprint--accepted for publication, Feb 201

    Class-wise Calibration: A Case Study on COVID-19 Hate Speech

    Get PDF
    Proper calibration of deep-learning models is critical for many high-stakes problems. In this paper, we show that existing calibration metrics fail to pay attention to miscalibration on individual classes, hence overlooking minority classes and causing significant issues on imbalanced classification problems. Using a COVID-19 hate-speech dataset, we first discover that in imbalanced datasets, miscalibration error on an individual class varies greatly, and error on minority classes can be magnitude times worse than what is suggested by the overall calibration performance. To address this issue, we propose a new metric based on expected miscalibration error, named as Contraharmonic Expected Calibration Error (CECE), which punishes severe miscalibration on individual classes. We further devise a novel variant of temperature scaling for imbalanced data to improve class-wise miscalibration, which re-weights the loss function by the inverse class count to tune the scaling parameter to reduce worst-case minority calibration error. Our case study on a benchmarking COVID-19 hate speech task shows the effectiveness of our calibration metric and our temperature scaling strategy

    Probability Aggregation in Time-Series: Dynamic Hierarchical Modeling of Sparse Expert Beliefs

    Get PDF
    Most subjective probability aggregation procedures use a single probability judgment from each expert, even though it is common for experts studying real problems to update their probability estimates over time. This paper advances into unexplored areas of probability aggregation by considering a dynamic context in which experts can update their beliefs at random intervals. The updates occur very infrequently, resulting in a sparse data set that cannot be modeled by standard time-series procedures. In response to the lack of appropriate methodology, this paper presents a hierarchical model that takes into account the expert’s level of self-reported expertise and produces aggregate probabilities that are sharp and well calibrated both in- and outof-sample. The model is demonstrated on a real-world data set that includes over 2300 experts making multiple probability forecasts over two years on different subsets of 166 international political events

    Statistical Challenges and Methods for Missing and Imbalanced Data

    Get PDF
    Missing data remains a prevalent issue in every area of research. The impact of missing data, if not carefully handled, can be detrimental to any statistical analysis. Some statistical challenges associated with missing data include, loss of information, reduced statistical power and non-generalizability of findings in a study. It is therefore crucial that researchers pay close and particular attention when dealing with missing data. This multi-paper dissertation provides insight into missing data across different fields of study and addresses some of the above mentioned challenges of missing data through simulation studies and application to real datasets. The first paper of this dissertation addresses the dropout phenomenon in single-cell RNA (scRNA) sequencing through a comparative analyses of some existing scRNA sequencing techniques. The second paper of this work focuses on using simulation studies to assess whether it is appropriate to address the issue of non-detects in data using a traditional substitution approach, imputation, or a non-imputation based approach. The final paper of this dissertation presents an efficient strategy to address the issue of imbalance in data at any degree (whether moderate or highly imbalanced) by combining random undersampling with different weighting strategies. We conclude generally, based on findings from this dissertation that, missingness is not always lack of information but interestingness that needs to investigated

    Absent Data Generating Classifier for Imbalanced Class Sizes

    Get PDF
    • …
    corecore