53 research outputs found
Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic
While significant progress has been made in benchmarking Large Language
Models (LLMs) across various tasks, there is a lack of comprehensive evaluation
of their abilities in responding to multi-turn instructions in less-commonly
tested languages like Arabic. Our paper offers a detailed examination of the
proficiency of open LLMs in such scenarios in Arabic. Utilizing a customized
Arabic translation of the MT-Bench benchmark suite, we employ GPT-4 as a
uniform evaluator for both English and Arabic queries to assess and compare the
performance of the LLMs on various open-ended tasks. Our findings reveal
variations in model responses on different task categories, e.g., logic vs.
literacy, when instructed in English or Arabic. We find that fine-tuned base
models using multilingual and multi-turn datasets could be competitive to
models trained from scratch on multilingual data. Finally, we hypothesize that
an ensemble of small, open LLMs could perform competitively to proprietary LLMs
on the benchmark.Comment: Accepted at SIGARAB ArabicNLP 202
Semantic Ranking for Automated Adversarial Technique Annotation in Security Text
We introduce a new method for extracting structured threat behaviors from
threat intelligence text. Our method is based on a multi-stage ranking
architecture that allows jointly optimizing for efficiency and effectiveness.
Therefore, we believe this problem formulation better aligns with the
real-world nature of the task considering the large number of adversary
techniques and the extensive body of threat intelligence created by security
analysts. Our findings show that the proposed system yields state-of-the-art
performance results for this task. Results show that our method has a top-3
recall performance of 81\% in identifying the relevant technique among 193
top-level techniques. Our tests also demonstrate that our system performs
significantly better (+40\%) than the widely used large language models when
tested under a zero-shot setting
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
The recent development and success of Large Language Models (LLMs)
necessitate an evaluation of their performance across diverse NLP tasks in
different languages. Although several frameworks have been developed and made
publicly available, their customization capabilities for specific tasks and
datasets are often complex for different users. In this study, we introduce the
LLMeBench framework, which can be seamlessly customized to evaluate LLMs for
any NLP task, regardless of language. The framework features generic dataset
loaders, several model providers, and pre-implements most standard evaluation
metrics. It supports in-context learning with zero- and few-shot settings. A
specific dataset and task can be evaluated for a given LLM in less than 20
lines of code while allowing full flexibility to extend the framework for
custom datasets, models, or tasks. The framework has been tested on 31 unique
NLP tasks using 53 publicly available datasets within 90 experimental setups,
involving approximately 296K data points. We open-sourced LLMeBench for the
community (https://github.com/qcri/LLMeBench/) and a video demonstrating the
framework is available online. (https://youtu.be/9cC2m_abk3A)Comment: Accepted as a demo paper at EACL 202
Benchmarking Arabic AI with Large Language Models
With large Foundation Models (FMs), language technologies (AI in general) are
entering a new paradigm: eliminating the need for developing large-scale
task-specific datasets and supporting a variety of tasks through set-ups
ranging from zero-shot to few-shot learning. However, understanding FMs
capabilities requires a systematic benchmarking effort by comparing FMs
performance with the state-of-the-art (SOTA) task-specific models. With that
goal, past work focused on the English language and included a few efforts with
multiple languages. Our study contributes to ongoing research by evaluating FMs
performance for standard Arabic NLP and Speech processing, including a range of
tasks from sequence tagging to content classification across diverse domains.
We start with zero-shot learning using GPT-3.5-turbo, Whisper, and USM,
addressing 33 unique tasks using 59 publicly available datasets resulting in 96
test setups. For a few tasks, FMs performs on par or exceeds the performance of
the SOTA models but for the majority it under-performs. Given the importance of
prompt for the FMs performance, we discuss our prompt strategies in detail and
elaborate on our findings. Our future work on Arabic AI will explore few-shot
prompting, expand the range of tasks, and investigate additional open-source
models.Comment: Foundation Models, Large Language Models, Arabic NLP, Arabic Speech,
Arabic AI, , CHatGPT Evaluation, USM Evaluation, Whisper Evaluatio
31st Annual Meeting and Associated Programs of the Society for Immunotherapy of Cancer (SITC 2016) : part two
Background
The immunological escape of tumors represents one of the main ob- stacles to the treatment of malignancies. The blockade of PD-1 or CTLA-4 receptors represented a milestone in the history of immunotherapy. However, immune checkpoint inhibitors seem to be effective in specific cohorts of patients. It has been proposed that their efficacy relies on the presence of an immunological response. Thus, we hypothesized that disruption of the PD-L1/PD-1 axis would synergize with our oncolytic vaccine platform PeptiCRAd.
Methods
We used murine B16OVA in vivo tumor models and flow cytometry analysis to investigate the immunological background.
Results
First, we found that high-burden B16OVA tumors were refractory to combination immunotherapy. However, with a more aggressive schedule, tumors with a lower burden were more susceptible to the combination of PeptiCRAd and PD-L1 blockade. The therapy signifi- cantly increased the median survival of mice (Fig. 7). Interestingly, the reduced growth of contralaterally injected B16F10 cells sug- gested the presence of a long lasting immunological memory also against non-targeted antigens. Concerning the functional state of tumor infiltrating lymphocytes (TILs), we found that all the immune therapies would enhance the percentage of activated (PD-1pos TIM- 3neg) T lymphocytes and reduce the amount of exhausted (PD-1pos TIM-3pos) cells compared to placebo. As expected, we found that PeptiCRAd monotherapy could increase the number of antigen spe- cific CD8+ T cells compared to other treatments. However, only the combination with PD-L1 blockade could significantly increase the ra- tio between activated and exhausted pentamer positive cells (p= 0.0058), suggesting that by disrupting the PD-1/PD-L1 axis we could decrease the amount of dysfunctional antigen specific T cells. We ob- served that the anatomical location deeply influenced the state of CD4+ and CD8+ T lymphocytes. In fact, TIM-3 expression was in- creased by 2 fold on TILs compared to splenic and lymphoid T cells. In the CD8+ compartment, the expression of PD-1 on the surface seemed to be restricted to the tumor micro-environment, while CD4 + T cells had a high expression of PD-1 also in lymphoid organs. Interestingly, we found that the levels of PD-1 were significantly higher on CD8+ T cells than on CD4+ T cells into the tumor micro- environment (p < 0.0001).
Conclusions
In conclusion, we demonstrated that the efficacy of immune check- point inhibitors might be strongly enhanced by their combination with cancer vaccines. PeptiCRAd was able to increase the number of antigen-specific T cells and PD-L1 blockade prevented their exhaus- tion, resulting in long-lasting immunological memory and increased median survival
- …