341 research outputs found
A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing
Many natural language processing (NLP) tasks are naturally imbalanced, as
some target categories occur much more frequently than others in the real
world. In such scenarios, current NLP models still tend to perform poorly on
less frequent classes. Addressing class imbalance in NLP is an active research
topic, yet, finding a good approach for a particular task and imbalance
scenario is difficult.
With this survey, the first overview on class imbalance in deep-learning
based NLP, we provide guidance for NLP researchers and practitioners dealing
with imbalanced data. We first discuss various types of controlled and
real-world class imbalance. Our survey then covers approaches that have been
explicitly proposed for class-imbalanced NLP tasks or, originating in the
computer vision community, have been evaluated on them. We organize the methods
by whether they are based on sampling, data augmentation, choice of loss
function, staged learning, or model design. Finally, we discuss open problems
such as dealing with multi-label scenarios, and propose systematic benchmarking
and reporting in order to move forward on this problem as a community
Multiple Relations Classification using Imbalanced Predictions Adaptation
The relation classification task assigns the proper semantic relation to a
pair of subject and object entities; the task plays a crucial role in various
text mining applications, such as knowledge graph construction and entities
interaction discovery in biomedical text. Current relation classification
models employ additional procedures to identify multiple relations in a single
sentence. Furthermore, they overlook the imbalanced predictions pattern. The
pattern arises from the presence of a few valid relations that need positive
labeling in a relatively large predefined relations set. We propose a multiple
relations classification model that tackles these issues through a customized
output architecture and by exploiting additional input features. Our findings
suggest that handling the imbalanced predictions leads to significant
improvements, even on a modest training design. The results demonstrate
superiority performance on benchmark datasets commonly used in relation
classification. To the best of our knowledge, this work is the first that
recognizes the imbalanced predictions within the relation classification task.Comment:
Simpson's Bias in NLP Training
In most machine learning tasks, we evaluate a model on a given data
population by measuring a population-level metric . Examples of
such evaluation metric include precision/recall for (binary) recognition,
the F1 score for multi-class classification, and the BLEU metric for language
generation. On the other hand, the model is trained by optimizing a
sample-level loss at each learning step , where is a subset
of (a.k.a. the mini-batch). Popular choices of include cross-entropy
loss, the Dice loss, and sentence-level BLEU scores. A fundamental assumption
behind this paradigm is that the mean value of the sample-level loss , if
averaged over all possible samples, should effectively represent the
population-level metric of the task, such as, that .
In this paper, we systematically investigate the above assumption in several
NLP tasks. We show, both theoretically and experimentally, that some popular
designs of the sample-level loss may be inconsistent with the true
population-level metric of the task, so that models trained to optimize the
former can be substantially sub-optimal to the latter, a phenomenon we call it,
Simpson's bias, due to its deep connections with the classic paradox known as
Simpson's reversal paradox in statistics and social sciences.Comment: AAAI 202
Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification
To obtain a large amount of training labels inexpensively, researchers have
recently adopted the weak supervision (WS) paradigm, which leverages labeling
rules to synthesize training labels rather than using individual annotations to
achieve competitive results for natural language processing (NLP) tasks.
However, data imbalance is often overlooked in applying the WS paradigm,
despite being a common issue in a variety of NLP tasks. To address this
challenge, we propose Adaptive Ranking-based Sample Selection (ARS2), a
model-agnostic framework to alleviate the data imbalance issue in the WS
paradigm. Specifically, it calculates a probabilistic margin score based on the
output of the current model to measure and rank the cleanliness of each data
point. Then, the ranked data are sampled based on both class-wise and
rule-aware ranking. In particular, the two sample strategies corresponds to our
motivations: (1) to train the model with balanced data batches to reduce the
data imbalance issue and (2) to exploit the expertise of each labeling rule for
collecting clean samples. Experiments on four text classification datasets with
four different imbalance ratios show that ARS2 outperformed the
state-of-the-art imbalanced learning and WS methods, leading to a 2%-57.8%
improvement on their F1-score
SEPSIS: I Can Catch Your Lies -- A New Paradigm for Deception Detection
Deception is the intentional practice of twisting information. It is a
nuanced societal practice deeply intertwined with human societal evolution,
characterized by a multitude of facets. This research explores the problem of
deception through the lens of psychology, employing a framework that
categorizes deception into three forms: lies of omission, lies of commission,
and lies of influence. The primary focus of this study is specifically on
investigating only lies of omission. We propose a novel framework for deception
detection leveraging NLP techniques. We curated an annotated dataset of 876,784
samples by amalgamating a popular large-scale fake news dataset and scraped
news headlines from the Twitter handle of Times of India, a well-known Indian
news media house. Each sample has been labeled with four layers, namely: (i)
the type of omission (speculation, bias, distortion, sounds factual, and
opinion), (ii) colors of lies(black, white, etc), and (iii) the intention of
such lies (to influence, etc) (iv) topic of lies (political, educational,
religious, etc). We present a novel multi-task learning pipeline that leverages
the dataless merging of fine-tuned language models to address the deception
detection task mentioned earlier. Our proposed model achieved an F1 score of
0.87, demonstrating strong performance across all layers including the type,
color, intent, and topic aspects of deceptive content. Finally, our research
explores the relationship between lies of omission and propaganda techniques.
To accomplish this, we conducted an in-depth analysis, uncovering compelling
findings. For instance, our analysis revealed a significant correlation between
loaded language and opinion, shedding light on their interconnectedness. To
encourage further research in this field, we will be making the models and
dataset available with the MIT License, making it favorable for open-source
research
NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets
Technical logbooks are a challenging and under-explored text type in automated event identification. These texts are typically short and written in non-standard yet technical language, posing challenges to off-the-shelf NLP pipelines. These datasets typically represent a domain (a technical field such as automotive) and an application (e.g., maintenance). The granularity of issue types described in these datasets additionally leads to class imbalance, making it challenging for models to accurately predict which issue each logbook entry describes. In this research, we focus on the problem of technical issue pre-processing, clustering, and classification by considering logbook datasets from the automotive, aviation, and facility maintenance domains. We developed MaintNet, a collaborative open source library including logbook datasets from various domains and a pre-processing pipeline to clean unstructured datasets. Additionally, we adapted a feedback loop strategy from computer vision for handling extreme class imbalance, which resamples the training data based on its error in the prediction process. We further investigated the benefits of using transfer learning from sources within the same domain (but different applications), from within the same application (but different domains), and from all available data to improve the performance of the classification models. Finally, we evaluated several data augmentation approaches including synonym replacement, random swap, and random deletion to address the issue of data scarcity in technical logbooks
Loss-Modified Transformer-Based U-Net for Accurate Segmentation of Fluids in Optical Coherence Tomography Images of Retinal Diseases.
Optical coherence tomography (OCT) imaging significantly contributes to ophthalmology in the diagnosis of retinal disorders such as age-related macular degeneration and diabetic macular edema. Both diseases involve the abnormal accumulation of fluids, location, and volume, which is vitally informative in detecting the severity of the diseases. Automated and accurate fluid segmentation in OCT images could potentially improve the current clinical diagnosis. This becomes more important by considering the limitations of manual fluid segmentation as a time-consuming and subjective to error method. Deep learning techniques have been applied to various image processing tasks, and their performance has already been explored in the segmentation of fluids in OCTs. This article suggests a novel automated deep learning method utilizing the U-Net structure as the basis. The modifications consist of the application of transformers in the encoder path of the U-Net with the purpose of more concentrated feature extraction. Furthermore, a custom loss function is empirically tailored to efficiently incorporate proper loss functions to deal with the imbalance and noisy images. A weighted combination of Dice loss, focal Tversky loss, and weighted binary cross-entropy is employed. Different metrics are calculated. The results show high accuracy (Dice coefficient of 95.52) and robustness of the proposed method in comparison to different methods after adding extra noise to the images (Dice coefficient of 92.79). The segmentation of fluid regions in retinal OCT images is critical because it assists clinicians in diagnosing macular edema and executing therapeutic operations more quickly. This study suggests a deep learning framework and novel loss function for automated fluid segmentation of retinal OCT images with excellent accuracy and rapid convergence result. [Abstract copyright: Copyright: © 2023 Journal of Medical Signals & Sensors.
Development of an Automated Scoring Model Using SentenceTransformers for Discussion Forums in Online Learning Environments
Due to the limitations of public datasets, research on automatic essay scoring in Indonesian has been restrained and resulted in suboptimal accuracy. In general, the main goal of the essay scoring system is to improve execution time, which is usually done manually with human judgment. This study uses a discussion forum in online learning to generate an assessment between the responses and the lecturer\u27s rubric in the automated essay scoring. A SentenceTransformers pre-trained model that can construct the highest vector embedding was proposed to identify the semantic meaning between the responses and the lecturer\u27s rubric. The effectiveness of monolingual and multilingual models was compared. This research aims to determine the model\u27s effectiveness and the appropriate model for the Automated Essay Scoring (AES) used in paired sentence Natural Language Processing tasks. The distiluse-base-multilingual-cased-v1 model, which employed the Pearson correlation method, obtained the highest performance. Specifically, it obtained a correlation value of 0.63 and a mean absolute error (MAE) score of 0.70. It indicates that the overall prediction result is enhanced when compared to the earlier regression task research
Deep-Learning Framework for Optimal Selection of Soil Sampling Sites
This work leverages the recent advancements of deep learning in image
processing to find optimal locations that present the important characteristics
of a field. The data for training are collected at different fields in local
farms with five features: aspect, flow accumulation, slope, NDVI (normalized
difference vegetation index), and yield. The soil sampling dataset is
challenging because the ground truth is highly imbalanced binary images.
Therefore, we approached the problem with two methods, the first approach
involves utilizing a state-of-the-art model with the convolutional neural
network (CNN) backbone, while the second is to innovate a deep-learning design
grounded in the concepts of transformer and self-attention. Our framework is
constructed with an encoder-decoder architecture with the self-attention
mechanism as the backbone. In the encoder, the self-attention mechanism is the
key feature extractor, which produces feature maps. In the decoder, we
introduce atrous convolution networks to concatenate, fuse the extracted
features, and then export the optimal locations for soil sampling. Currently,
the model has achieved impressive results on the testing dataset, with a mean
accuracy of 99.52%, a mean Intersection over Union (IoU) of 57.35%, and a mean
Dice Coefficient of 71.47%, while the performance metrics of the
state-of-the-art CNN-based model are 66.08%, 3.85%, and 1.98%, respectively.
This indicates that our proposed model outperforms the CNN-based method on the
soil-sampling dataset. To the best of our knowledge, our work is the first to
provide a soil-sampling dataset with multiple attributes and leverage deep
learning techniques to enable the automatic selection of soil-sampling sites.
This work lays a foundation for novel applications of data science and
machine-learning technologies to solve other emerging agricultural problems.Comment: This paper is the full version of a poster presented at the AI in
Agriculture Conference 2023 in Orlando, FL, US
- …