11 research outputs found
Semantic Parsing by Large Language Models for Intricate Updating Strategies of Zero-Shot Dialogue State Tracking
Zero-shot Dialogue State Tracking (DST) addresses the challenge of acquiring
and annotating task-oriented dialogues, which can be time-consuming and costly.
However, DST extends beyond simple slot-filling and requires effective updating
strategies for tracking dialogue state as conversations progress. In this
paper, we propose ParsingDST, a new In-Context Learning (ICL) method, to
introduce additional intricate updating strategies in zero-shot DST. Our
approach reformulates the DST task by leveraging powerful Large Language Models
(LLMs) and translating the original dialogue text to JSON through semantic
parsing as an intermediate state. We also design a novel framework that
includes more modules to ensure the effectiveness of updating strategies in the
text-to-JSON process. Experimental results demonstrate that our approach
outperforms existing zero-shot DST methods on MultiWOZ, exhibiting significant
improvements in Joint Goal Accuracy (JGA) and slot accuracy compared to
existing ICL methods. Our code has been released.Comment: Accepted to the Findings of EMNLP 2023 (Short Paper
InstructERC: Reforming Emotion Recognition in Conversation with a Retrieval Multi-task LLMs Framework
The development of emotion recognition in dialogue (ERC) has been
consistently hindered by the complexity of pipeline designs, leading to ERC
models that often overfit to specific datasets and dialogue patterns. In this
study, we propose a novel approach, namely
InstructERC, to reformulates the ERC task from a discriminative framework to
a generative framework based on Large Language Models (LLMs) . InstructERC has
two significant contributions: Firstly, InstructERC introduces a simple yet
effective retrieval template module, which helps the model explicitly integrate
multi-granularity dialogue supervision information by concatenating the
historical dialog content, label statement, and emotional domain demonstrations
with high semantic similarity. Furthermore, we introduce two additional emotion
alignment tasks, namely speaker identification and emotion prediction tasks, to
implicitly model the dialogue role relationships and future emotional
tendencies in conversations. Our LLM-based plug-and-play plugin framework
significantly outperforms all previous models and achieves comprehensive SOTA
on three commonly used ERC datasets. Extensive analysis of parameter-efficient
and data-scaling experiments provide empirical guidance for applying
InstructERC in practical scenarios. Our code will be released after blind
review
OccuQuest: Mitigating Occupational Bias for Inclusive Large Language Models
The emergence of large language models (LLMs) has revolutionized natural
language processing tasks. However, existing instruction-tuning datasets suffer
from occupational bias: the majority of data relates to only a few occupations,
which hampers the instruction-tuned LLMs to generate helpful responses to
professional queries from practitioners in specific fields. To mitigate this
issue and promote occupation-inclusive LLMs, we create an instruction-tuning
dataset named \emph{OccuQuest}, which contains 110,000+ prompt-completion pairs
and 30,000+ dialogues covering over 1,000 occupations in 26 occupational
categories. We systematically request ChatGPT, organizing queries
hierarchically based on Occupation, Responsibility, Topic, and Question, to
ensure a comprehensive coverage of occupational specialty inquiries. By
comparing with three commonly used datasets (Dolly, ShareGPT, and WizardLM), we
observe that OccuQuest exhibits a more balanced distribution across
occupations. Furthermore, we assemble three test sets for comprehensive
evaluation, an occu-test set covering 25 occupational categories, an estate set
focusing on real estate, and an occu-quora set containing real-world questions
from Quora. We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which
significantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and
WizardLM) on professional questions in GPT-4 and human evaluations. Notably, on
the occu-quora set, OccuLLaMA reaches a high win rate of 86.4\% against
WizardLM
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Mathematical reasoning is a challenging task for large language models
(LLMs), while the scaling relationship of it with respect to LLM capacity is
under-explored. In this paper, we investigate how the pre-training loss,
supervised data amount, and augmented data amount influence the reasoning
performances of a supervised LLM. We find that pre-training loss is a better
indicator of the model's performance than the model's parameter count. We apply
supervised fine-tuning (SFT) with different amounts of supervised data and
empirically find a log-linear relation between data amount and model
performance, and we find better models improve less with enlarged supervised
datasets. To augment more data samples for improving model performances without
any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT
uses supervised models to generate and collect correct reasoning paths as
augmented fine-tuning datasets. We find with augmented samples containing more
distinct reasoning paths, RFT improves mathematical reasoning performance more
for LLMs. We also find RFT brings more improvement for less performant LLMs.
Furthermore, we combine rejection samples from multiple models which push
LLaMA-7B to an accuracy of 49.3\% on GSM8K which outperforms the supervised
fine-tuning (SFT) accuracy of 35.9\% significantly.Comment: Working in Progres
ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models
Knowledge Base Question Answering (KBQA) aims to derive answers to natural
language questions over large-scale knowledge bases (KBs), which are generally
divided into two research components: knowledge retrieval and semantic parsing.
However, three core challenges remain, including inefficient knowledge
retrieval, retrieval errors adversely affecting semantic parsing, and the
complexity of previous KBQA methods. In the era of large language models
(LLMs), we introduce ChatKBQA, a novel generate-then-retrieve KBQA framework
built on fine-tuning open-source LLMs such as Llama-2, ChatGLM2 and Baichuan2.
ChatKBQA proposes generating the logical form with fine-tuned LLMs first, then
retrieving and replacing entities and relations through an unsupervised
retrieval method, which improves both generation and retrieval more
straightforwardly. Experimental results reveal that ChatKBQA achieves new
state-of-the-art performance on standard KBQA datasets, WebQSP, and
ComplexWebQuestions (CWQ). This work also provides a new paradigm for combining
LLMs with knowledge graphs (KGs) for interpretable and knowledge-required
question answering. Our code is publicly available.Comment: Preprin
Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task
With the increasing capabilities of large language models (LLMs), these
high-performance models have achieved state-of-the-art results on a wide range
of natural language processing (NLP) tasks. However, the models' performance on
commonly-used benchmark datasets often fails to accurately reflect their
reliability and robustness when applied to real-world noisy data. To address
these challenges, we propose a unified robustness evaluation framework based on
the slot-filling task to systematically evaluate the dialogue understanding
capability of LLMs in diverse input perturbation scenarios. Specifically, we
construct a input perturbation evaluation dataset, Noise-LLM, which contains
five types of single perturbation and four types of mixed perturbation data.
Furthermore, we utilize a multi-level data augmentation method (character,
word, and sentence levels) to construct a candidate data pool, and carefully
design two ways of automatic task demonstration construction strategies
(instance-level and entity-level) with various prompt templates. Our aim is to
assess how well various robustness methods of LLMs perform in real-world noisy
scenarios. The experiments have demonstrated that the current open-source LLMs
generally achieve limited perturbation robustness performance. Based on these
experimental observations, we make some forward-looking suggestions to fuel the
research in this direction.Comment: Accepted at NLPCC 2023 (Oral Presentation
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
The rapid development of large language models has revolutionized code
intelligence in software development. However, the predominance of
closed-source models has restricted extensive research and development. To
address this, we introduce the DeepSeek-Coder series, a range of open-source
code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion
tokens. These models are pre-trained on a high-quality project-level code
corpus and employ a fill-in-the-blank task with a 16K window to enhance code
generation and infilling. Our extensive evaluations demonstrate that
DeepSeek-Coder not only achieves state-of-the-art performance among open-source
code models across multiple benchmarks but also surpasses existing
closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models
are under a permissive license that allows for both research and unrestricted
commercial use
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
The rapid development of open-source large language models (LLMs) has been
truly remarkable. However, the scaling law described in previous literature
presents varying conclusions, which casts a dark cloud over scaling LLMs. We
delve into the study of scaling laws and present our distinctive findings that
facilitate scaling of large scale models in two commonly used open-source
configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek
LLM, a project dedicated to advancing open-source language models with a
long-term perspective. To support the pre-training phase, we have developed a
dataset that currently consists of 2 trillion tokens and is continuously
expanding. We further conduct supervised fine-tuning (SFT) and Direct
Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the
creation of DeepSeek Chat models. Our evaluation results demonstrate that
DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in
the domains of code, mathematics, and reasoning. Furthermore, open-ended
evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance
compared to GPT-3.5
State Evaluation and Fault Prediction of Protection System Equipment Based on Digital Twin Technology
Digital twin technology aims to build a map of physical entities in virtual space, and simulate the real-time state and dynamic characteristics of physical devices through bi-directional interactive data flow. In order to guarantee the safe and stable operation of protection system equipment in intelligent substations and improve the efficiency of protection system operation and condition maintenance, this paper proposes a protection system state evaluation and fault prediction method based on digital twin technology. The architecture, application and operation control mode of digital twin technology in a real-time state analysis of a protection system are studied. The state evaluation model based on matter-element extension and the fault prediction model based on clustering algorithm are constructed. By analyzing the historical data of an intelligent station protection system in actual operation, a database that can be updated and corrected in real time is constructed. The effectiveness and accuracy of the state evaluation and fault prediction method are verified with actual cases, which can provide technical support for the operation and maintenance of the protection system