2 research outputs found
Can LLMs facilitate interpretation of pre-trained language models?
Work done to uncover the knowledge encoded within pre-trained language
models, rely on annotated corpora or human-in-the-loop methods. However, these
approaches are limited in terms of scalability and the scope of interpretation.
We propose using a large language model, ChatGPT, as an annotator to enable
fine-grained interpretation analysis of pre-trained language models. We
discover latent concepts within pre-trained language models by applying
hierarchical clustering over contextualized representations and then annotate
these concepts using GPT annotations. Our findings demonstrate that ChatGPT
produces accurate and semantically richer annotations compared to
human-annotated concepts. Additionally, we showcase how GPT-based annotations
empower interpretation analysis methodologies of which we demonstrate two:
probing framework and neuron interpretation. To facilitate further exploration
and experimentation in this field, we have made available a substantial
ConceptNet dataset comprising 39,000 annotated latent concepts
Benchmarking Arabic AI with Large Language Models
With large Foundation Models (FMs), language technologies (AI in general) are
entering a new paradigm: eliminating the need for developing large-scale
task-specific datasets and supporting a variety of tasks through set-ups
ranging from zero-shot to few-shot learning. However, understanding FMs
capabilities requires a systematic benchmarking effort by comparing FMs
performance with the state-of-the-art (SOTA) task-specific models. With that
goal, past work focused on the English language and included a few efforts with
multiple languages. Our study contributes to ongoing research by evaluating FMs
performance for standard Arabic NLP and Speech processing, including a range of
tasks from sequence tagging to content classification across diverse domains.
We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM,
addressing 33 unique tasks using 59 publicly available datasets resulting in 96
test setups. For a few tasks, FMs performs on par or exceeds the performance of
the SOTA models but for the majority it under-performs. Given the importance of
prompt for the FMs performance, we discuss our prompt strategies in detail and
elaborate on our findings. Our future work on Arabic AI will explore few-shot
prompting, expand the range of tasks, and investigate additional open-source
models.Comment: Foundation Models, Large Language Models, Arabic NLP, Arabic Speech,
Arabic AI, , CHatGPT Evaluation, USM Evaluation, Whisper Evaluatio