91 research outputs found
Automatic Measurement of Pre-aspiration
Pre-aspiration is defined as the period of glottal friction occurring in
sequences of vocalic/consonantal sonorants and phonetically voiceless
obstruents. We propose two machine learning methods for automatic measurement
of pre-aspiration duration: a feedforward neural network, which works at the
frame level; and a structured prediction model, which relies on manually
designed feature functions, and works at the segment level. The input for both
algorithms is a speech signal of an arbitrary length containing a single
obstruent, and the output is a pair of times which constitutes the
pre-aspiration boundaries. We train both models on a set of manually annotated
examples. Results suggest that the structured model is superior to the
frame-based model as it yields higher accuracy in predicting the boundaries and
generalizes to new speakers and new languages. Finally, we demonstrate the
applicability of our structured prediction algorithm by replicating linguistic
analysis of pre-aspiration in Aberystwyth English with high correlation
Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units
We introduce DISSC, a novel, lightweight method that converts the rhythm,
pitch contour and timbre of a recording to a target speaker in a textless
manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on
timbre, and ignore people's unique speaking style (prosody). The proposed
approach uses a pretrained, self-supervised model for encoding speech to
discrete units, which makes it simple, effective, and fast to train. All
conversion modules are only trained on reconstruction like tasks, thus suitable
for any-to-many VC with no paired data. We introduce a suite of quantitative
and qualitative evaluation metrics for this setup, and empirically demonstrate
that DISSC significantly outperforms the evaluated baselines. Code and samples
are available at https://pages.cs.huji.ac.il/adiyoss-lab/dissc/.Comment: Accepted at EMNLP 202
I Hear Your True Colors: Image Guided Audio Generation
We propose Im2Wav, an image guided open-domain audio generation system. Given
an input image or a sequence of images, Im2Wav generates a semantically
relevant sound. Im2Wav is based on two Transformer language models, that
operate over a hierarchical discrete audio representation obtained from a
VQ-VAE based model. We first produce a low-level audio representation using a
language model. Then, we upsample the audio tokens using an additional language
model to generate a high-fidelity audio sample. We use the rich semantics of a
pre-trained CLIP embedding as a visual representation to condition the language
model. In addition, to steer the generation process towards the conditioning
image, we apply the classifier-free guidance method. Results suggest that
Im2Wav significantly outperforms the evaluated baselines in both fidelity and
relevance evaluation metrics. Additionally, we provide an ablation study to
better assess the impact of each of the method components on overall
performance. Lastly, to better evaluate image-to-audio models, we propose an
out-of-domain image dataset, denoted as ImageHear. ImageHear can be used as a
benchmark for evaluating future image-to-audio models. Samples and code can be
found inside the manuscript
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
We propose a self-supervised representation learning model for the task of
unsupervised phoneme boundary detection. The model is a convolutional neural
network that operates directly on the raw waveform. It is optimized to identify
spectral changes in the signal using the Noise-Contrastive Estimation
principle. At test time, a peak detection algorithm is applied over the model
outputs to produce the final boundaries. As such, the proposed model is trained
in a fully unsupervised manner with no manual annotations in the form of target
boundaries nor phonetic transcriptions. We compare the proposed approach to
several unsupervised baselines using both TIMIT and Buckeye corpora. Results
suggest that our approach surpasses the baseline models and reaches
state-of-the-art performance on both data sets. Furthermore, we experimented
with expanding the training set with additional examples from the Librispeech
corpus. We evaluated the resulting model on distributions and languages that
were not seen during the training phase (English, Hebrew and German) and showed
that utilizing additional untranscribed data is beneficial for model
performance.Comment: Interspeech 2020 pape
- …