23 research outputs found
SciFix: Outperforming GPT3 on Scientific Factual Error Correction
Due to the prohibitively high cost of creating error correction datasets,
most Factual Claim Correction methods rely on a powerful verification model to
guide the correction process. This leads to a significant drop in performance
in domains like scientific claims, where good verification models do not always
exist. In this work, we introduce SciFix, a scientific claim correction system
that does not require a verifier but can outperform existing methods by a
considerable margin -- achieving correction accuracy of 84% on the SciFact
dataset, 77% on SciFact-Open and 72% on the CovidFact dataset, compared to next
best accuracies of 7%, 5%, and 15% on the same datasets respectively. Our
method leverages the power of prompting with LLMs during training to create a
richly annotated dataset that can be used for fully supervised training and
regularization. We additionally use a claim-aware decoding procedure to improve
the quality of corrected claims. Our method outperforms the very LLM that was
used to generate the annotated dataset -- with Few-Shot Prompting on GPT3.5
achieving 58%, 61%, and 64% on the respective datasets, a consistently lower
correction accuracy, despite using nearly 800 times as many parameters as our
model.Comment: To appear in proceedings of EMNLP2023 (findings
Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment
Social media is awash with hateful content, much of which is often veiled
with linguistic and topical diversity. The benchmark datasets used for hate
speech detection do not account for such divagation as they are predominantly
compiled using hate lexicons. However, capturing hate signals becomes
challenging in neutrally-seeded malicious content. Thus, designing models and
datasets that mimic the real-world variability of hate warrants further
investigation.
To this end, we present GOTHate, a large-scale code-mixed crowdsourced
dataset of around 51k posts for hate speech detection from Twitter. GOTHate is
neutrally seeded, encompassing different languages and topics. We conduct
detailed comparisons of GOTHate with the existing hate speech datasets,
highlighting its novelty. We benchmark it with 10 recent baselines. Our
extensive empirical and benchmarking experiments suggest that GOTHate is hard
to classify in a text-only setup. Thus, we investigate how adding endogenous
signals enhances the hate speech detection task. We augment GOTHate with the
user's timeline information and ego network, bringing the overall data source
closer to the real-world setup for understanding hateful content. Our proposed
solution HEN-mBERT is a modular, multilingual, mixture-of-experts model that
enriches the linguistic subspace with latent endogenous signals from history,
topology, and exemplars. HEN-mBERT transcends the best baseline by 2.5% and 5%
in overall macro-F1 and hate class F1, respectively. Inspired by our
experiments, in partnership with Wipro AI, we are developing a semi-automated
pipeline to detect hateful content as a part of their mission to tackle online
harm.Comment: 15 pages, 4 figures, 11 tables. Accepted at SIGKDD'2
Critical Analysis of Heat Exchanger Cycle for its Maintainability Using Failure Modes and Effect Analysis and Pareto Analysis
The Failure Modes and Effect Analysis (FMEA) is an efficient evaluation technique to identify potential failures in products, processes, and services. FMEA is designed to identify and prioritize failure modes. It proves to be a useful method for identifying and correcting possible failures at its earliest possible level so that one can avoid consequences of poor performance. In this paper, FMEA tool is used in detection of failures of various components of heat exchanger cycle and to identify critical failures of the components which may hamper the system’s performance. Further, a detailed Pareto analysis is done to find out the most critical components of the cycle, the causes of its failures, and possible recommended actions. This paper can be used as a checklist which will help in maintainability of the system
Characterizing the Entities in Harmful Memes: Who is the Hero, the Villain, the Victim?
Memes can sway people's opinions over social media as they combine visual and
textual information in an easy-to-consume manner. Since memes instantly turn
viral, it becomes crucial to infer their intent and potentially associated
harmfulness to take timely measures as needed. A common problem associated with
meme comprehension lies in detecting the entities referenced and characterizing
the role of each of these entities. Here, we aim to understand whether the meme
glorifies, vilifies, or victimizes each entity it refers to. To this end, we
address the task of role identification of entities in harmful memes, i.e.,
detecting who is the 'hero', the 'villain', and the 'victim' in the meme, if
any. We utilize HVVMemes - a memes dataset on US Politics and Covid-19 memes,
released recently as part of the CONSTRAINT@ACL-2022 shared-task. It contains
memes, entities referenced, and their associated roles: hero, villain, victim,
and other. We further design VECTOR (Visual-semantic role dEteCToR), a robust
multi-modal framework for the task, which integrates entity-based contextual
information in the multi-modal representation and compare it to several
standard unimodal (text-only or image-only) or multi-modal (image+text) models.
Our experimental results show that our proposed model achieves an improvement
of 4% over the best baseline and 1% over the best competing stand-alone
submission from the shared-task. Besides divulging an extensive experimental
setup with comparative analyses, we finally highlight the challenges
encountered in addressing the complex task of semantic role labeling within
memes.Comment: Accepted at EACL 2023 (Main Track). 9 Pages (main content),
Limitations, Ethical Considerations + 4 Pages (Refs.) + Appendix; 8 Figures;
5 Tables; Paper ID: 80
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Large language models (LLMs) have recently reached an impressive level of
linguistic capability, prompting comparisons with human language skills.
However, there have been relatively few systematic inquiries into the
linguistic capabilities of the latest generation of LLMs, and those studies
that do exist (i) ignore the remarkable ability of humans to generalize, (ii)
focus only on English, and (iii) investigate syntax or semantics and overlook
other capabilities that lie at the heart of human language, like morphology.
Here, we close these gaps by conducting the first rigorous analysis of the
morphological capabilities of ChatGPT in four typologically varied languages
(specifically, English, German, Tamil, and Turkish). We apply a version of
Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for
the four examined languages. We find that ChatGPT massively underperforms
purpose-built systems, particularly in English. Overall, our results -- through
the lens of morphology -- cast a new light on the linguistic capabilities of
ChatGPT, suggesting that claims of human-like language skills are premature and
misleading.Comment: EMNLP 202
The gravitational-wave background null hypothesis: Characterizing noise in millisecond pulsar arrival times with the Parkes Pulsar Timing Array
The noise in millisecond pulsar (MSP) timing data can include contributions
from observing instruments, the interstellar medium, the solar wind, solar
system ephemeris errors, and the pulsars themselves. The noise environment must
be accurately characterized in order to form the null hypothesis from which
signal models can be compared, including the signature induced by
nanohertz-frequency gravitational waves (GWs). Here we describe the noise
models developed for each of the MSPs in the Parkes Pulsar Timing Array (PPTA)
third data release, which have been used as the basis of a search for the
isotropic stochastic GW background. We model pulsar spin noise, dispersion
measure variations, scattering variations, events in the pulsar magnetospheres,
solar wind variability, and instrumental effects. We also search for new timing
model parameters and detected Shapiro delays in PSR~J06143329 and
PSR~J19025105. The noise and timing models are validated by testing the
normalized and whitened timing residuals for Gaussianity and residual
correlations with time. We demonstrate that the choice of noise models
significantly affects the inferred properties of a common-spectrum process.
Using our detailed models, the recovered common-spectrum noise in the PPTA is
consistent with a power law with a spectral index of , the value
predicted for a stochastic GW background from a population of supermassive
black hole binaries driven solely by GW emission.Comment: 18 pages, 10 figures. Accepted for publication in ApJ