98 research outputs found
FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency
Text-based speech editing (TSE) techniques are designed to enable users to
edit the output audio by modifying the input text transcript instead of the
audio itself. Despite much progress in neural network-based TSE techniques, the
current techniques have focused on reducing the difference between the
generated speech segment and the reference target in the editing region,
ignoring its local and global fluency in the context and original utterance. To
maintain the speech fluency, we propose a fluency speech editing model, termed
\textit{FluentEditor}, by considering fluency-aware training criterion in the
TSE training. Specifically, the \textit{acoustic consistency constraint} aims
to smooth the transition between the edited region and its neighboring acoustic
segments consistent with the ground truth, while the \textit{prosody
consistency constraint} seeks to ensure that the prosody attributes within the
edited regions remain consistent with the overall style of the original
utterance. The subjective and objective experimental results on VCTK
demonstrate that our \textit{FluentEditor} outperforms all advanced baselines
in terms of naturalness and fluency. The audio samples and code are available
at \url{https://github.com/Ai-S2-Lab/FluentEditor}.Comment: Submitted to ICASSP'202
When Online Auction Meets Virtual Reality: An Empirical Investigation
The online auction is becoming increasingly popular in e-commerce, which allows to sell a product to the buyer with the highest bid. However, the lack of authentic product details for a thorough evaluation still poses challenges to its success. Recently, virtual reality (VR) is introduced to online auctions. We employ a unique dataset to investigate the effects of VR on auction outcomes and bidding activities. Results show that VR enhances buyers’ bidding competition, which in turn increases auction success and price, resulting in a competitive effect. Additionally, we find VR boosts buyers’ strategic responses to the bidding war, leading to a late-bidding effect. Findings contribute to both the theory and practice of VR and online auctions in selling houses
Validity-Preserving Delta Debugging via Generator
Reducing test inputs that trigger bugs is crucial for efficient debugging.
Delta debugging is the most popular approach for this purpose. When test inputs
need to conform to certain specifications, existing delta debugging practice
encounters a validity problem: it blindly applies reduction rules, producing a
large number of invalid test inputs that do not satisfy the required
specifications. This overall diminishing effectiveness and efficiency becomes
even more pronounced when the specifications extend beyond syntactical
structures. Our key insight is that we should leverage input generators, which
are aware of these specifications, to generate valid reduced inputs, rather
than straightforwardly performing reduction on test inputs. In this paper, we
propose a generator-based delta debugging method, namely GReduce, which derives
validity-preserving reducers. Specifically, given a generator and its
execution, demonstrating how the bug-inducing test input is generated, GReduce
searches for other executions on the generator that yield reduced, valid test
inputs. To evaluate the effectiveness, efficiency, and versatility of GReduce,
we apply GReduce and the state-of-the-art reducer Perses in three domains:
graphs, deep learning models, and JavaScript programs. The results of GReduce
are 28.5%, 34.6%, 75.6% in size of those from Perses, and GReduce takes 17.5%,
0.6%, 65.4% time taken by Perses
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models
Stutter removal is an essential scenario in the field of speech editing.
However, when the speech recording contains stutters, the existing text-based
speech editing approaches still suffer from: 1) the over-smoothing problem in
the edited speech; 2) lack of robustness due to the noise introduced by
stutter; 3) to remove the stutters, users are required to determine the edited
region manually. To tackle the challenges in stutter removal, we propose
FluentSpeech, a stutter-oriented automatic speech editing model. Specifically,
1) we propose a context-aware diffusion model that iteratively refines the
modified mel-spectrogram with the guidance of context features; 2) we introduce
a stutter predictor module to inject the stutter information into the hidden
sequence; 3) we also propose a stutter-oriented automatic speech editing (SASE)
dataset that contains spontaneous speech recordings with time-aligned stutter
labels to train the automatic stutter localization model. Experimental results
on VCTK and LibriTTS datasets demonstrate that our model achieves
state-of-the-art performance on speech editing. Further experiments on our SASE
dataset show that FluentSpeech can effectively improve the fluency of
stuttering speech in terms of objective and subjective metrics. Code and audio
samples can be found at https://github.com/Zain-Jiang/Speech-Editing-Toolkit.Comment: Accepted by ACL 2023 (Findings
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Polyphone disambiguation aims to capture accurate pronunciation knowledge
from natural text sequences for reliable Text-to-speech (TTS) systems. However,
previous approaches require substantial annotated training data and additional
efforts from language experts, making it difficult to extend high-quality
neural TTS systems to out-of-domain daily conversations and countless languages
worldwide. This paper tackles the polyphone disambiguation problem from a
concise and novel perspective: we propose Dict-TTS, a semantic-aware generative
text-to-speech model with an online website dictionary (the existing prior
information in the natural language). Specifically, we design a
semantics-to-pronunciation attention (S2PA) module to match the semantic
patterns between the input text sequence and the prior semantics in the
dictionary and obtain the corresponding pronunciations; The S2PA module can be
easily trained with the end-to-end TTS model without any annotated phoneme
labels. Experimental results in three languages show that our model outperforms
several strong baseline models in terms of pronunciation accuracy and improves
the prosody modeling of TTS systems. Further extensive analyses demonstrate
that each design in Dict-TTS is effective. The code is available at
\url{https://github.com/Zain-Jiang/Dict-TTS}.Comment: Accepted by NeurIPS 202
- …