64 research outputs found
Karaoker: Alignment-free singing voice synthesis with speech training data
Existing singing voice synthesis models (SVS) are usually trained on singing
data and depend on either error-prone time-alignment and duration features or
explicit music score information. In this paper, we propose Karaoker, a
multispeaker Tacotron-based model conditioned on voice characteristic features
that is trained exclusively on spoken data without requiring time-alignments.
Karaoker synthesizes singing voice following a multi-dimensional template
extracted from a source waveform of an unseen speaker/singer. The model is
jointly conditioned with a single deep convolutional encoder on continuous data
including pitch, intensity, harmonicity, formants, cepstral peak prominence and
octaves. We extend the text-to-speech training objective with feature
reconstruction, classification and speaker identification tasks that guide the
model to an accurate result. Except for multi-tasking, we also employ a
Wasserstein GAN training scheme as well as new losses on the acoustic model's
output to further refine the quality of the model.Comment: Submitted to INTERSPEECH 202
Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis
A large part of the expressive speech synthesis literature focuses on
learning prosodic representations of the speech signal which are then modeled
by a prior distribution during inference. In this paper, we compare different
prior architectures at the task of predicting phoneme level prosodic
representations extracted with an unsupervised FVAE model. We use both
subjective and objective metrics to show that normalizing flow based prior
networks can result in more expressive speech at the cost of a slight drop in
quality. Furthermore, we show that the synthesized speech has higher
variability, for a given text, due to the nature of normalizing flows. We also
propose a Dynamical VAE model, that can generate higher quality speech although
with decreased expressiveness and variability compared to the flow based
models.Comment: Submitted to ICASSP 202
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis
This paper proposes an Expressive Speech Synthesis model that utilizes
token-level latent prosodic variables in order to capture and control
utterance-level attributes, such as character acting voice and speaking style.
Current works aim to explicitly factorize such fine-grained and utterance-level
speech attributes into different representations extracted by modules that
operate in the corresponding level. We show that the fine-grained latent space
also captures coarse-grained information, which is more evident as the
dimension of latent space increases in order to capture diverse prosodic
representations. Therefore, a trade-off arises between the diversity of the
token-level and utterance-level representations and their disentanglement. We
alleviate this issue by first capturing rich speech attributes into a
token-level latent space and then, separately train a prior network that given
the input text, learns utterance-level representations in order to predict the
phoneme-level, posterior latents extracted during the previous step. Both
qualitative and quantitative evaluations are used to demonstrate the
effectiveness of the proposed approach. Audio samples are available in our demo
page.Comment: Submitted to ICASSP 202
Fine-grained Noise Control for Multispeaker Speech Synthesis
A text-to-speech (TTS) model typically factorizes speech attributes such as
content, speaker and prosody into disentangled representations.Recent works aim
to additionally model the acoustic conditions explicitly, in order to
disentangle the primary speech factors, i.e. linguistic content, prosody and
timbre from any residual factors, such as recording conditions and background
noise.This paper proposes unsupervised, interpretable and fine-grained noise
and prosody modeling. We incorporate adversarial training, representation
bottleneck and utterance-to-frame modeling in order to learn frame-level noise
representations. To the same end, we perform fine-grained prosody modeling via
a Fully Hierarchical Variational AutoEncoder (FVAE) which additionally results
in more expressive speech synthesis.Comment: Accepted to INTERSPEECH 202
Self-supervised learning for robust voice cloning
Voice cloning is a difficult task which requires robust and informative
features incorporated in a high quality TTS system in order to effectively copy
an unseen speaker's voice. In our work, we utilize features learned in a
self-supervised framework via the Bootstrap Your Own Latent (BYOL) method,
which is shown to produce high quality speech representations when specific
audio augmentations are applied to the vanilla algorithm. We further extend the
augmentations in the training procedure to aid the resulting features to
capture the speaker identity and to make them robust to noise and acoustic
conditions. The learned features are used as pre-trained utterance-level
embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming
to achieve multispeaker speech synthesis without utilizing additional speaker
features. This method enables us to train our model in an unlabeled
multispeaker dataset as well as use unseen speaker embeddings to copy a
speaker's voice. Subjective and objective evaluations are used to validate the
proposed model, as well as the robustness to the acoustic conditions of the
target utterance.Comment: Accepted to INTERSPEECH 202
Will the US Economy Recover in 2010? A Minimal Spanning Tree Study
We calculated the cross correlations between the half-hourly times series of
the ten Dow Jones US economic sectors over the period February 2000 to August
2008, the two-year intervals 2002--2003, 2004--2005, 2008--2009, and also over
11 segments within the present financial crisis, to construct minimal spanning
trees (MSTs) of the US economy at the sector level. In all MSTs, a core-fringe
structure is found, with consumer goods, consumer services, and the industrials
consistently making up the core, and basic materials, oil and gas, healthcare,
telecommunications, and utilities residing predominantly on the fringe. More
importantly, we find that the MSTs can be classified into two distinct,
statistically robust, topologies: (i) star-like, with the industrials at the
center, associated with low-volatility economic growth; and (ii) chain-like,
associated with high-volatility economic crisis. Finally, we present
statistical evidence, based on the emergence of a star-like MST in Sep 2009,
and the MST staying robustly star-like throughout the Greek Debt Crisis, that
the US economy is on track to a recovery.Comment: elsarticle class, includes amsmath.sty, graphicx.sty and url.sty. 68
pages, 16 figures, 8 tables. Abridged version of the manuscript presented at
the Econophysics Colloquim 2010, incorporating reviewer comment
Radiotherapy combined with nivolumab or temozolomide for newly diagnosed glioblastoma with unmethylated MGMT promoter: An international randomized phase III trial
BACKGROUND: Addition of temozolomide (TMZ) to radiotherapy (RT) improves overall survival (OS) in patients with glioblastoma (GBM), but previous studies suggest that patients with tumors harboring an unmethylated MGMT promoter derive minimal benefit. The aim of this open-label, phase III CheckMate 498 study was to evaluate the efficacy of nivolumab (NIVO) + RT compared with TMZ + RT in newly diagnosed GBM with unmethylated MGMT promoter. METHODS: Patients were randomized 1:1 to standard RT (60 Gy) + NIVO (240 mg every 2 weeks for eight cycles, then 480 mg every 4 weeks) or RT + TMZ (75 mg/m2 daily during RT and 150-200 mg/m2/day 5/28 days during maintenance). The primary endpoint was OS. RESULTS: A total of 560 patients were randomized, 280 to each arm. Median OS (mOS) was 13.4 months (95% CI, 12.6 to 14.3) with NIVO + RT and 14.9 months (95% CI, 13.3 to 16.1) with TMZ + RT (hazard ratio [HR], 1.31; 95% CI, 1.09 to 1.58; P = .0037). Median progression-free survival was 6.0 months (95% CI, 5.7 to 6.2) with NIVO + RT and 6.2 months (95% CI, 5.9 to 6.7) with TMZ + RT (HR, 1.38; 95% CI, 1.15 to 1.65). Response rates were 7.8% (9/116) with NIVO + RT and 7.2% (8/111) with TMZ + RT; grade 3/4 treatment-related adverse event (TRAE) rates were 21.9% and 25.1%, and any-grade serious TRAE rates were 17.3% and 7.6%, respectively. CONCLUSIONS: The study did not meet the primary endpoint of improved OS; TMZ + RT demonstrated a longer mOS than NIVO + RT. No new safety signals were detected with NIVO in this study. The difference between the study treatment arms is consistent with the use of TMZ + RT as the standard of care for GBM.ClinicalTrials.gov NCT02617589
Liquidity risk in spot foreign exchange markets
SIGLEAvailable from British Library Document Supply Centre-DSC:DXN038509 / BLDSC - British Library Document Supply CentreGBUnited Kingdo
EVIDENCE-BASED HEALTH PROMOTION: EXPLORING THE EVOLUTION OF THE EFFECTIVENESS OF SCHOOL-BASED ANTI-BULLYING INTERVENTIONS OVER TIME
The objectives of this thesis were to explore how effectiveness of school-based anti-bullying interventions (SBABI) evolves over time and to assess the possibility to predict the medium-term or long-term effectiveness of SBABIs on the basis of their short-term effectiveness. The first step included a literature review in order to understand the study designs and evaluation techniques that researches used to assess the effectiveness. This literature review described the methodologies based on which researchers collected evidence and concluded on the effectiveness of their SBABIs. In order to address the thesis objectives, a collaborative project was established, named SET-Bullying (“Statistical modelling of the Effectiveness of school based anti-bullying interventions and Time”). The above-mentioned literature review was used to identify potentially eligible studies. After addressing a call for collaboration to the corresponding authors of these studies, this project included data from two of them, the DFE-SHEFFIELD study from United Kingdom and the RESPEKT study from Norway. Both of these studies have used pupil self-reported frequencies on being bullied and bullying others as an effectiveness measure, but using different instruments to elicit this information. Thus, the subsequent step of this thesis was to harmonize the data from these studies using polychoric principal components analysis, in order to be able to perform the same analysis with the data from both studies. The data from both studies were analysed using mixed effect models in order to take into account the hierarchical (i.e. the responses of pupils from the same school may be more correlated with each other as opposed to the responses of pupils from different schools) and the longitudinal structure (i.e. same pupils are more likely to respond in a similar way in the repeated measurements of each studies) of the data. With regard to the primary objective of the thesis, it was observed that effectiveness (where it is observed) may evolve either in a linear fashion or a “delayed effect” may be observed. This refers to a minimal evolution of effectiveness over the first study measurements and a sharper evolution at the later study measurements. This finding is only hypothesis generating at this point. Would this be confirmed in future studies, it will have important implication of the design, implementation and evaluations of SBABIs. About the secondary objective of this thesis, there were some preliminary findings of the possibility to predict the medium-term or long-term effectiveness based on the short-term effectiveness. However, these predictions in some cases seemed to be very variable. Future research should focus on how to make these predictions more accurate in order that this allows for dynamic and adaptable delivery of SBABIs.Doctorat en Santé Publiqueinfo:eu-repo/semantics/nonPublishe
- …