105 research outputs found
Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models
There has been a significant progress in Text-To-Speech (TTS) synthesis
technology in recent years, thanks to the advancement in neural generative
modeling. However, existing methods on any-speaker adaptive TTS have achieved
unsatisfactory performance, due to their suboptimal accuracy in mimicking the
target speakers' styles. In this work, we present Grad-StyleSpeech, which is an
any-speaker adaptive TTS framework that is based on a diffusion model that can
generate highly natural speech with extremely high similarity to target
speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech
significantly outperforms recent speaker-adaptive TTS baselines on English
benchmarks. Audio samples are available at
https://nardien.github.io/grad-stylespeech-demo.Comment: ICASSP 202
KALA: Knowledge-Augmented Language Model Adaptation
Pre-trained language models (PLMs) have achieved remarkable success on
various natural language understanding tasks. Simple fine-tuning of PLMs, on
the other hand, might be suboptimal for domain-specific tasks because they
cannot possibly cover knowledge from all domains. While adaptive pre-training
of PLMs can help them obtain domain-specific knowledge, it requires a large
training cost. Moreover, adaptive pre-training can harm the PLM's performance
on the downstream task by causing catastrophic forgetting of its general
knowledge. To overcome such limitations of adaptive pre-training for PLM
adaption, we propose a novel domain adaption framework for PLMs coined as
Knowledge-Augmented Language model Adaptation (KALA), which modulates the
intermediate hidden representations of PLMs with domain knowledge, consisting
of entities and their relational facts. We validate the performance of our KALA
on question answering and named entity recognition tasks on multiple datasets
across various domains. The results show that, despite being computationally
efficient, our KALA largely outperforms adaptive pre-training. Code is
available at: https://github.com/Nardien/KALA/.Comment: NAACL 202
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models
Emotional Text-To-Speech (TTS) is an important task in the development of
systems (e.g., human-like dialogue agents) that require natural and emotional
speech. Existing approaches, however, only aim to produce emotional TTS for
seen speakers during training, without consideration of the generalization to
unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive
emotion-controllable TTS model that allows users to synthesize any speaker's
emotional speech using only a short, neutral speech segment and the target
emotion label. Specifically, to enable a zero-shot adaptive TTS model to
synthesize emotional speech, we propose domain adversarial learning and
guidance methods on the diffusion model. Experimental results demonstrate that
ZET-Speech successfully synthesizes natural and emotional speech with the
desired emotion for both seen and unseen speakers. Samples are at
https://ZET-Speech.github.io/ZET-Speech-Demo/.Comment: Accepted by INTERSPEECH 202
Knowledge Graph-Augmented Language Models for Knowledge-Grounded Dialogue Generation
Language models have achieved impressive performances on dialogue generation
tasks. However, when generating responses for a conversation that requires
factual knowledge, they are far from perfect, due to an absence of mechanisms
to retrieve, encode, and reflect the knowledge in the generated responses. Some
knowledge-grounded dialogue generation methods tackle this problem by
leveraging facts from Knowledge Graphs (KGs); however, they do not guarantee
that the model utilizes a relevant piece of knowledge from the KG. To overcome
this limitation, we propose SUbgraph Retrieval-augmented GEneration (SURGE), a
framework for generating context-relevant and knowledge-grounded dialogues with
the KG. Specifically, our SURGE framework first retrieves the relevant subgraph
from the KG, and then enforces consistency across facts by perturbing their
word embeddings conditioned by the retrieved subgraph. Then, we utilize
contrastive learning to ensure that the generated texts have high similarity to
the retrieved subgraphs. We validate our SURGE framework on OpendialKG and
KOMODIS datasets, showing that it generates high-quality dialogues that
faithfully reflect the knowledge from KG.Comment: Preprint. Under revie
Self-Distillation for Further Pre-training of Transformers
Pre-training a large transformer model on a massive amount of unlabeled data
and fine-tuning it on labeled datasets for diverse downstream tasks has proven
to be a successful strategy, for a variety of vision and natural language
processing tasks. However, direct fine-tuning of the pre-trained model may be
suboptimal if there exist large discrepancies across data domains for
pre-training and fine-tuning. To tackle this issue, several previous studies
have proposed further pre-training strategies, where we continue to pre-train
the model on the target unlabeled dataset before fine-tuning. However, all of
them solely focus on language models and we empirically find that a Vision
Transformer is vulnerable to overfitting as we continue to pretrain the model
on target unlabeled data. In order to tackle this limitation, we propose
self-distillation as a regularization for a further pre-training stage.
Specifically, we first further pre-train the initial pre-trained model on the
target unlabeled data and then consider it as a teacher for self-distillation.
Then we take the same initial pre-trained model as a student and enforce its
hidden representations to be close to those of the teacher while optimizing the
student with a masked auto-encoding objective. We empirically validate the
efficacy of self-distillation on a variety of benchmark datasets for image and
text classification tasks. Experimentally, we show that our proposed method
outperforms all the relevant baselines. Theoretically, we analyze the proposed
method with a simplified model to understand how self-distillation for further
pre-training can potentially help improve the performance of the downstream
tasks.Comment: ICLR 202
- …