493,055 research outputs found
AT: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing
Recently, speech representation learning has improved many speech-related
tasks such as speech recognition, speech classification, and speech-to-text
translation. However, all the above tasks are in the direction of speech
understanding, but for the inverse direction, speech synthesis, the potential
of representation learning is yet to be realized, due to the challenging nature
of generating high-quality speech. To address this problem, we propose our
framework, Alignment-Aware Acoustic-Text Pretraining (AT), which
reconstructs masked acoustic signals with text input and acoustic-text
alignment during training. In this way, the pretrained model can generate high
quality of reconstructed spectrogram, which can be applied to the speech
editing and unseen speaker TTS directly. Experiments show AT outperforms
SOTA models on speech editing, and improves multi-speaker speech synthesis
without the external speaker verification model.Comment: under review, 12 pages, 10 figure
Towards a Knowledge Graph based Speech Interface
Applications which use human speech as an input require a speech interface
with high recognition accuracy. The words or phrases in the recognised text are
annotated with a machine-understandable meaning and linked to knowledge graphs
for further processing by the target application. These semantic annotations of
recognised words can be represented as a subject-predicate-object triples which
collectively form a graph often referred to as a knowledge graph. This type of
knowledge representation facilitates to use speech interfaces with any spoken
input application, since the information is represented in logical, semantic
form, retrieving and storing can be followed using any web standard query
languages. In this work, we develop a methodology for linking speech input to
knowledge graphs and study the impact of recognition errors in the overall
process. We show that for a corpus with lower WER, the annotation and linking
of entities to the DBpedia knowledge graph is considerable. DBpedia Spotlight,
a tool to interlink text documents with the linked open data is used to link
the speech recognition output to the DBpedia knowledge graph. Such a
knowledge-based speech recognition interface is useful for applications such as
question answering or spoken dialog systems.Comment: Under Review in International Workshop on Grounding Language
Understanding, Satellite of Interspeech 201
Few-Shot Spoken Language Understanding via Joint Speech-Text Models
Recent work on speech representation models jointly pre-trained with text has
demonstrated the potential of improving speech representations by encoding
speech and text in a shared space. In this paper, we leverage such shared
representations to address the persistent challenge of limited data
availability in spoken language understanding tasks. By employing a pre-trained
speech-text model, we find that models fine-tuned on text can be effectively
transferred to speech testing data. With as little as 1 hour of labeled speech
data, our proposed approach achieves comparable performance on spoken language
understanding tasks (specifically, sentiment analysis and named entity
recognition) when compared to previous methods using speech-only pre-trained
models fine-tuned on 10 times more data. Beyond the proof-of-concept study, we
also analyze the latent representations. We find that the bottom layers of
speech-text models are largely task-agnostic and align speech and text
representations into a shared space, while the top layers are more
task-specific
RepCodec: A Speech Representation Codec for Speech Tokenization
With recent rapid growth of large language models (LLMs), discrete speech
tokenization has played an important role for injecting speech into LLMs.
However, this discretization gives rise to a loss of information, consequently
impairing overall performance. To improve the performance of these discrete
speech tokens, we present RepCodec, a novel speech representation codec for
semantic speech tokenization. In contrast to audio codecs which reconstruct the
raw audio, RepCodec learns a vector quantization codebook through
reconstructing speech representations from speech encoders like HuBERT or
data2vec. Together, the speech encoder, the codec encoder and the vector
quantization codebook form a pipeline for converting speech waveforms into
semantic tokens. The extensive experiments illustrate that RepCodec, by virtue
of its enhanced information retention capacity, significantly outperforms the
widely used k-means clustering approach in both speech understanding and
generation. Furthermore, this superiority extends across various speech
encoders and languages, affirming the robustness of RepCodec. We believe our
method can facilitate large language modeling research on speech processing
CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training
Speech or text representation generated by pre-trained models contains
modal-specific information that could be combined for benefiting spoken
language understanding (SLU) tasks. In this work, we propose a novel
pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training
(CIF-PT). It relies on a simple but effective frame-to-token alignment:
continuous integrate-and-fire (CIF) to bridge the representations between
speech and text. It jointly performs speech-to-text training and language model
distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark
SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of
accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot
filling, respectively. We also observe the cross-modal representation extracted
by CIF-PT obtains better performance than other neural interfaces for the tasks
of SLU, including the dominant speech representation learned from
self-supervised pre-training.Comment: Accepted by ACL 2023 Finding
Using Ontology-Based Approaches to Representing Speech Transcripts for Automated Speech Scoring
Text representation is a process of transforming text into some formats that computer systems can use for subsequent information-related tasks such as text classification. Representing text faces two main challenges: meaningfulness of representation and unknown terms. Research has shown evidence that these challenges can be resolved by using the rich semantics in ontologies. This study aims to address these challenges by using ontology-based representation and unknown term reasoning approaches in the context of content scoring of speech, which is a less explored area compared to some common ones such as categorizing text corpus (e.g. 20 newsgroups and Reuters).
From the perspective of language assessment, the increasing amount of language learners taking second language tests makes automatic scoring an attractive alternative to human scoring for delivering rapid and objective scores of written and spoken test responses. This study focuses on the speaking section of second language tests and investigates ontology-based approaches to speech scoring. Most previous automated speech scoring systems for spontaneous responses of test takers assess speech by primarily using acoustic features such as fluency and pronunciation, while text features are less involved and exploited. As content is an integral part of speech, the study is motivated by the lack of rich text features in speech scoring and is designed to examine the effects of different text features on scoring performance.
A central question to the study is how speech transcript content can be represented in an appropriate means for speech scoring. Previously used approaches from essay and speech scoring systems include bag-of-words and latent semantic analysis representations, which are adopted as baselines in this study; the experimental approaches are ontology-based, which can help improving meaningfulness of representation units and estimating importance of unknown terms. Two general domain ontologies, WordNet and Wikipedia, are used respectively for ontology-based representations. In addition to comparison between representation approaches, the author analyzes which parameter option leads to the best performance within a particular representation.
The experimental results show that on average, ontology-based representations slightly enhances speech scoring performance on all measurements when combined with the bag-of-words representation; reasoning of unknown terms can increase performance on one measurement (cos.w4) but decrease others. Due to the small data size, the significance test (t-test) shows that the enhancement of ontology-based representations is inconclusive.
The contributions of the study include: 1) it examines the effects of different representation approaches on speech scoring tasks; 2) it enhances the understanding of the mechanisms of representation approaches and their parameter options via in-depth analysis; 3) the representation methodology and framework can be applied to other tasks such as automatic essay scoring
Bootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddings
Inducing semantic representations directly from speech signals is a highly challenging task but has many useful applications in speech mining and spoken language understanding. This study tackles the unsupervised learning of semantic representations for spoken utterances. Through converting speech signals into hidden units generated from acoustic unit discovery, we propose WavEmbed, a multimodal sequential autoencoder that predicts hidden units from a dense representation of speech. Secondly, we also propose S-HuBERT to induce meaning through knowledge distillation, in which a sentence embedding model is first trained on hidden units and passes its knowledge to a speech encoder through contrastive learning. The best performing model achieves a moderate correlation (0.5~0.6) with human judgments, without relying on any labels or transcriptions. Furthermore, these models can also be easily extended to leverage textual transcriptions of speech to learn much better speech embeddings that are strongly correlated with human annotations. Our proposed methods are applicable to the development of purely data-driven systems for speech mining, indexing and search
Building phonetic categories: an argument for the role of sleep
The current review provides specific predictions for the role of sleep-mediated memory consolidation in the formation of new speech sound representations. Specifically, this discussion will highlight selected literature on the different ideas concerning category representation in speech, followed by a broad overview of memory consolidation and how it relates to human behavior, as relevant to speech/perceptual learning. In combining behavioral and physiological accounts from animal models with insights from the human consolidation literature on auditory skill/word learning, we are in the early stages of understanding how the transfer of experiential information between brain structures during sleep manifests in changes to online perception. Arriving at the conclusion that this process is crucial in perceptual learning and the formation of novel categories, further speculation yields the adjacent claim that the habitual disruption in this process leads to impoverished quality in the representation of speech sounds
The brain’s conversation with itself: neural substrates of dialogic inner speech
Inner speech has been implicated in important aspects of normal and atypical cognition, including the development of auditory hallucinations. Studies to date have focused on covert speech elicited by simple word or sentence repetition, while ignoring richer and arguably more psychologically significant varieties of inner speech. This study compared neural activation for inner speech involving conversations (‘dialogic inner speech’) with single-speaker scenarios (‘monologic inner speech’). Inner speech-related activation differences were then compared with activations relating to Theory-of-Mind (ToM) reasoning and visual perspective-taking in a conjunction design. Generation of dialogic (compared with monologic) scenarios was associated with a widespread bilateral network including left and right superior temporal gyri, precuneus, posterior cingulate and left inferior and medial frontal gyri. Activation associated with dialogic scenarios and ToM reasoning overlapped in areas of right posterior temporal cortex previously linked to mental state representation. Implications for understanding verbal cognition in typical and atypical populations are discussed
- …