135 research outputs found
On Identifying Hashtags in Disaster Twitter Data
Tweet hashtags have the potential to improve the search for information
during disaster events. However, there is a large number of disaster-related
tweets that do not have any user-provided hashtags. Moreover, only a small
number of tweets that contain actionable hashtags are useful for disaster
response. To facilitate progress on automatic identification (or extraction) of
disaster hashtags for Twitter data, we construct a unique dataset of
disaster-related tweets annotated with hashtags useful for filtering actionable
information. Using this dataset, we further investigate Long Short Term
Memory-based models within a Multi-Task Learning framework. The best performing
model achieves an F1-score as high as 92.22%. The dataset, code, and other
resources are available on Github
Hierarchical Multi-Label Classification of Scientific Documents
Automatic topic classification has been studied extensively to assist
managing and indexing scientific documents in a digital collection. With the
large number of topics being available in recent years, it has become necessary
to arrange them in a hierarchy. Therefore, the automatic classification systems
need to be able to classify the documents hierarchically. In addition, each
paper is often assigned to more than one relevant topic. For example, a paper
can be assigned to several topics in a hierarchy tree. In this paper, we
introduce a new dataset for hierarchical multi-label text classification
(HMLTC) of scientific papers called SciHTC, which contains 186,160 papers and
1,233 categories from the ACM CCS tree. We establish strong baselines for HMLTC
and propose a multi-task learning approach for topic classification with
keyword labeling as an auxiliary task. Our best model achieves a Macro-F1 score
of 34.57% which shows that this dataset provides significant research
opportunities on hierarchical scientific topic classification. We make our
dataset and code available on Github.Comment: Accepted in EMNLP 2022 main conferenc
Learning to Infer from Unlabeled Data: A Semi-supervised Learning Approach for Robust Natural Language Inference
Natural Language Inference (NLI) or Recognizing Textual Entailment (RTE) aims
at predicting the relation between a pair of sentences (premise and hypothesis)
as entailment, contradiction or semantic independence. Although deep learning
models have shown promising performance for NLI in recent years, they rely on
large scale expensive human-annotated datasets. Semi-supervised learning (SSL)
is a popular technique for reducing the reliance on human annotation by
leveraging unlabeled data for training. However, despite its substantial
success on single sentence classification tasks where the challenge in making
use of unlabeled data is to assign "good enough" pseudo-labels, for NLI tasks,
the nature of unlabeled data is more complex: one of the sentences in the pair
(usually the hypothesis) along with the class label are missing from the data
and require human annotations, which makes SSL for NLI more challenging. In
this paper, we propose a novel way to incorporate unlabeled data in SSL for NLI
where we use a conditional language model, BART to generate the hypotheses for
the unlabeled sentences (used as premises). Our experiments show that our SSL
framework successfully exploits unlabeled data and substantially improves the
performance of four NLI datasets in low-resource settings. We release our code
at: https://github.com/msadat3/SSL_for_NLI.Comment: Accepted in EMNLP 2022 (Findings
MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins
We introduce MarginMatch, a new SSL approach combining consistency
regularization and pseudo-labeling, with its main novelty arising from the use
of unlabeled data training dynamics to measure pseudo-label quality. Instead of
using only the model's confidence on an unlabeled example at an arbitrary
iteration to decide if the example should be masked or not, MarginMatch also
analyzes the behavior of the model on the pseudo-labeled examples as the
training progresses, to ensure low quality predictions are masked out.
MarginMatch brings substantial improvements on four vision benchmarks in low
data regimes and on two large-scale datasets, emphasizing the importance of
enforcing high-quality pseudo-labels. Notably, we obtain an improvement in
error rate over the state-of-the-art of 3.25% on CIFAR-100 with only 25 labels
per class and of 3.78% on STL-10 using as few as 4 labels per class. We make
our code available at https://github.com/tsosea2/MarginMatch
Dynamic Deep Multi-modal Fusion for Image Privacy Prediction
With millions of images that are shared online on social networking sites,
effective methods for image privacy prediction are highly needed. In this
paper, we propose an approach for fusing object, scene context, and image tags
modalities derived from convolutional neural networks for accurately predicting
the privacy of images shared online. Specifically, our approach identifies the
set of most competent modalities on the fly, according to each new target image
whose privacy has to be predicted. The approach considers three stages to
predict the privacy of a target image, wherein we first identify the
neighborhood images that are visually similar and/or have similar sensitive
content as the target image. Then, we estimate the competence of the modalities
based on the neighborhood images. Finally, we fuse the decisions of the most
competent modalities and predict the privacy label for the target image.
Experimental results show that our approach predicts the sensitive (or private)
content more accurately than the models trained on individual modalities
(object, scene, and tags) and prior privacy prediction works. Also, our
approach outperforms strong baselines, that train meta-classifiers to obtain an
optimal combination of modalities.Comment: Accepted by The Web Conference (WWW) 201
- …