64 research outputs found

    Karaoker: Alignment-free singing voice synthesis with speech training data

    Full text link
    Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice following a multi-dimensional template extracted from a source waveform of an unseen speaker/singer. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. Except for multi-tasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.Comment: Submitted to INTERSPEECH 202

    Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

    Full text link
    A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics to show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality. Furthermore, we show that the synthesized speech has higher variability, for a given text, due to the nature of normalizing flows. We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models.Comment: Submitted to ICASSP 202

    Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

    Full text link
    This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step. Both qualitative and quantitative evaluations are used to demonstrate the effectiveness of the proposed approach. Audio samples are available in our demo page.Comment: Submitted to ICASSP 202

    Fine-grained Noise Control for Multispeaker Speech Synthesis

    Full text link
    A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper proposes unsupervised, interpretable and fine-grained noise and prosody modeling. We incorporate adversarial training, representation bottleneck and utterance-to-frame modeling in order to learn frame-level noise representations. To the same end, we perform fine-grained prosody modeling via a Fully Hierarchical Variational AutoEncoder (FVAE) which additionally results in more expressive speech synthesis.Comment: Accepted to INTERSPEECH 202

    Self-supervised learning for robust voice cloning

    Full text link
    Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.Comment: Accepted to INTERSPEECH 202

    Will the US Economy Recover in 2010? A Minimal Spanning Tree Study

    Full text link
    We calculated the cross correlations between the half-hourly times series of the ten Dow Jones US economic sectors over the period February 2000 to August 2008, the two-year intervals 2002--2003, 2004--2005, 2008--2009, and also over 11 segments within the present financial crisis, to construct minimal spanning trees (MSTs) of the US economy at the sector level. In all MSTs, a core-fringe structure is found, with consumer goods, consumer services, and the industrials consistently making up the core, and basic materials, oil and gas, healthcare, telecommunications, and utilities residing predominantly on the fringe. More importantly, we find that the MSTs can be classified into two distinct, statistically robust, topologies: (i) star-like, with the industrials at the center, associated with low-volatility economic growth; and (ii) chain-like, associated with high-volatility economic crisis. Finally, we present statistical evidence, based on the emergence of a star-like MST in Sep 2009, and the MST staying robustly star-like throughout the Greek Debt Crisis, that the US economy is on track to a recovery.Comment: elsarticle class, includes amsmath.sty, graphicx.sty and url.sty. 68 pages, 16 figures, 8 tables. Abridged version of the manuscript presented at the Econophysics Colloquim 2010, incorporating reviewer comment

    Radiotherapy combined with nivolumab or temozolomide for newly diagnosed glioblastoma with unmethylated MGMT promoter: An international randomized phase III trial

    Get PDF
    BACKGROUND: Addition of temozolomide (TMZ) to radiotherapy (RT) improves overall survival (OS) in patients with glioblastoma (GBM), but previous studies suggest that patients with tumors harboring an unmethylated MGMT promoter derive minimal benefit. The aim of this open-label, phase III CheckMate 498 study was to evaluate the efficacy of nivolumab (NIVO) + RT compared with TMZ + RT in newly diagnosed GBM with unmethylated MGMT promoter. METHODS: Patients were randomized 1:1 to standard RT (60 Gy) + NIVO (240 mg every 2 weeks for eight cycles, then 480 mg every 4 weeks) or RT + TMZ (75 mg/m2 daily during RT and 150-200 mg/m2/day 5/28 days during maintenance). The primary endpoint was OS. RESULTS: A total of 560 patients were randomized, 280 to each arm. Median OS (mOS) was 13.4 months (95% CI, 12.6 to 14.3) with NIVO + RT and 14.9 months (95% CI, 13.3 to 16.1) with TMZ + RT (hazard ratio [HR], 1.31; 95% CI, 1.09 to 1.58; P = .0037). Median progression-free survival was 6.0 months (95% CI, 5.7 to 6.2) with NIVO + RT and 6.2 months (95% CI, 5.9 to 6.7) with TMZ + RT (HR, 1.38; 95% CI, 1.15 to 1.65). Response rates were 7.8% (9/116) with NIVO + RT and 7.2% (8/111) with TMZ + RT; grade 3/4 treatment-related adverse event (TRAE) rates were 21.9% and 25.1%, and any-grade serious TRAE rates were 17.3% and 7.6%, respectively. CONCLUSIONS: The study did not meet the primary endpoint of improved OS; TMZ + RT demonstrated a longer mOS than NIVO + RT. No new safety signals were detected with NIVO in this study. The difference between the study treatment arms is consistent with the use of TMZ + RT as the standard of care for GBM.ClinicalTrials.gov NCT02617589

    Liquidity risk in spot foreign exchange markets

    No full text
    SIGLEAvailable from British Library Document Supply Centre-DSC:DXN038509 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    EVIDENCE-BASED HEALTH PROMOTION: EXPLORING THE EVOLUTION OF THE EFFECTIVENESS OF SCHOOL-BASED ANTI-BULLYING INTERVENTIONS OVER TIME

    No full text
    The objectives of this thesis were to explore how effectiveness of school-based anti-bullying interventions (SBABI) evolves over time and to assess the possibility to predict the medium-term or long-term effectiveness of SBABIs on the basis of their short-term effectiveness. The first step included a literature review in order to understand the study designs and evaluation techniques that researches used to assess the effectiveness. This literature review described the methodologies based on which researchers collected evidence and concluded on the effectiveness of their SBABIs. In order to address the thesis objectives, a collaborative project was established, named SET-Bullying (“Statistical modelling of the Effectiveness of school based anti-bullying interventions and Time”). The above-mentioned literature review was used to identify potentially eligible studies. After addressing a call for collaboration to the corresponding authors of these studies, this project included data from two of them, the DFE-SHEFFIELD study from United Kingdom and the RESPEKT study from Norway. Both of these studies have used pupil self-reported frequencies on being bullied and bullying others as an effectiveness measure, but using different instruments to elicit this information. Thus, the subsequent step of this thesis was to harmonize the data from these studies using polychoric principal components analysis, in order to be able to perform the same analysis with the data from both studies. The data from both studies were analysed using mixed effect models in order to take into account the hierarchical (i.e. the responses of pupils from the same school may be more correlated with each other as opposed to the responses of pupils from different schools) and the longitudinal structure (i.e. same pupils are more likely to respond in a similar way in the repeated measurements of each studies) of the data. With regard to the primary objective of the thesis, it was observed that effectiveness (where it is observed) may evolve either in a linear fashion or a “delayed effect” may be observed. This refers to a minimal evolution of effectiveness over the first study measurements and a sharper evolution at the later study measurements. This finding is only hypothesis generating at this point. Would this be confirmed in future studies, it will have important implication of the design, implementation and evaluations of SBABIs. About the secondary objective of this thesis, there were some preliminary findings of the possibility to predict the medium-term or long-term effectiveness based on the short-term effectiveness. However, these predictions in some cases seemed to be very variable. Future research should focus on how to make these predictions more accurate in order that this allows for dynamic and adaptable delivery of SBABIs.Doctorat en Santé Publiqueinfo:eu-repo/semantics/nonPublishe
    • …
    corecore