7 research outputs found
A Fully Time-domain Neural Model for Subband-based Speech Synthesizer
This paper introduces a deep neural network model for subband-based speech
synthesizer. The model benefits from the short bandwidth of the subband signals
to reduce the complexity of the time-domain speech generator. We employed the
multi-level wavelet analysis/synthesis to decompose/reconstruct the signal into
subbands in time domain. Inspired from the WaveNet, a convolutional neural
network (CNN) model predicts subband speech signals fully in time domain. Due
to the short bandwidth of the subbands, a simple network architecture is enough
to train the simple patterns of the subbands accurately. In the ground truth
experiments with teacher-forcing, the subband synthesizer outperforms the
fullband model significantly in terms of both subjective and objective
measures. In addition, by conditioning the model on the phoneme sequence using
a pronunciation dictionary, we have achieved the fully time-domain neural model
for subband-based text-to-speech (TTS) synthesizer, which is nearly end-to-end.
The generated speech of the subband TTS shows comparable quality as the
fullband one with a slighter network architecture for each subband.Comment: 5 pages, 3 figur
Spoofing Detection in Voice Biometrics: Cochlear Modelling and Perceptually Motivated Features
The automatic speaker verification (ASV) system is one of the most widely adopted biometric
technology. However, ASV is vulnerable to spoofing attacks that can significantly affect its
reliability. Among the different variants of spoofing attacks, replay attacks pose a major threat as
they do not require any expert knowledge to implement and are difficult to detect. The primary focus
of this thesis is on understanding and developing biologically inspired models and techniques to
detect replay attacks.
This thesis develops a novel framework for implementing an active cochlear filter model as a frontend
spectral analyser for spoofing attack detection to leverage the remarkable sensitivity and
selectivity of the mammalian auditory system over a broad range of intensities and frequencies. In
particular, the developed model aims to mimic the active mechanism in the cochlea, enabling sharp
frequency tuning and level-dependent compression, which amplifies and tune to low energy signal
to make a broad dynamic range of signals audible. Experimental evaluations of the developed models
in the context of replay detection systems exhibit a significant performance improvement,
highlighting the potential benefits of the use of biologically inspired front ends.
In addition, since replay detection relies on the discerning channel characteristics and the effect of
the acoustic environment, acoustic cues essential for speech perception such as amplitude- and
frequency-modulation (AM, FM) features are also investigated. Finally, to capture discriminative
cues present in the temporal domain, the temporal masking psychoacoustic phenomenon in auditory
processing is exploited, and the usefulness of the masking pattern is investigated. This led to a novel
feature parameterisation which helps improve replay attack detection
Modeling of Polish Intonation for Statistical-Parametric Speech Synthesis
Wydział NeofilologiiBieżąca praca prezentuje próbę budowy neurobiologicznie umotywowanego modelu mapowań pomiędzy wysokopoziomowymi dyskretnymi kategoriami lingwistycznymi a ciągłym sygnałem częstotliwości podstawowej w polskiej neutralnej mowie czytanej, w oparciu o konwolucyjne sieci neuronowe. Po krótkim wprowadzeniu w problem badawczy w kontekście intonacji, syntezy mowy oraz luki pomiędzy fonetyką a fonologią, praca przedstawia opis uczenia modelu na podstawie specjalnego korpusu mowy oraz ewaluację naturalności konturu F0 generowanego przez wyuczony model za pomocą
eksperymentów percepcyjnych typu ABX oraz MOS przy użyciu specjalnie w tym celu zbudowanego resyntezatora Neural Source Filter. Następnie, prezentowane są wyniki eksploracji fonologiczno-fonetycznych mapowań wyuczonych przez model. W tym celu wykorzystana została
jedna z tzw. metod wyjaśniających dla sztucznej inteligencji – Layer-wise Relevance Propagation.
W pracy przedstawione zostały wyniki powstałej na tej podstawie obszernej analizy ilościowej
istotności dla konturu częstotliwości podstawowej każdej z 1297 specjalnie wygenerowanych
lingwistycznych kategorii wejściowych modelu oraz ich wielorakich grupowań na różnorodnych poziomach abstrakcji. Pracę kończy dogłębna analiza oraz interpretacja uzyskanych wyników oraz rozważania na temat mocnych oraz słabych stron zastosowanych metod, a także lista proponowanych usprawnień.This work presents an attempt to build a neurobiologically inspired Convolutional Neural
Network-based model of the mappings between discrete high-level linguistic categories into a
continuous signal of fundamental frequency in Polish neutral read speech. After a brief
introduction of the current research problem in the context of intonation, speech synthesis and the
phonetic-phonology gap, the work goes on to describe the training of the model on a special speech corpus, and an evaluation of the naturalness of the F0 contour produced by the trained model through ABX and MOS perception experiments conducted with help of a specially built Neural Source Filter resynthesizer. Finally, an in-depth exploration of the phonology-to-phonetics mappings learned by the model is presented; the Layer-wise Relevance Propagation explainability method was used to perform an extensive quantitative analysis of the relevance of 1297 specially engineered linguistic input features and
their groupings at various levels of abstraction for the specific contours of the fundamental frequency.
The work ends with an in-depth interpretation of these results and a discussion of the advantages
and disadvantages of the current method, and lists a number of potential future improvements.Badania przedstawione w pracy zostały cz˛e´sciowo zrealizowane w ramach grantu badawczego Harmonia nr UMO-2014/14/M/HS2/00631 przyznanego przez Narodowe Centrum Nauki