73 research outputs found
Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge
In this paper, we describe the systems developed by the SJTU X-LANCE team for
LIMMITS 2023 Challenge, and we mainly focus on the winning system on
naturalness for track 1. The aim of this challenge is to build a multi-speaker
multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each
of the languages has a male and a female speaker in the given dataset. In track
1, only 5 hours data from each speaker can be selected to train the TTS model.
Our system is based on the recently proposed VQTTS that utilizes VQ acoustic
feature rather than mel-spectrogram. We introduce additional speaker embeddings
and language embeddings to VQTTS for controlling the speaker and language
information. In the cross-lingual evaluations where we need to synthesize
speech in a cross-lingual speaker's voice, we provide a native speaker's
embedding to the acoustic model and the target speaker's embedding to the
vocoder. In the subjective MOS listening test on naturalness, our system
achieves 4.77 which ranks first.Comment: Accepted by ICASSP 2023 Special Session for Grand Challenge
Acoustic BPE for Speech Generation with Discrete Tokens
Discrete audio tokens derived from self-supervised learning models have
gained widespread usage in speech generation. However, current practice of
directly utilizing audio tokens poses challenges for sequence modeling due to
the length of the token sequence. Additionally, this approach places the burden
on the model to establish correlations between tokens, further complicating the
modeling process. To address this issue, we propose acoustic BPE which encodes
frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE
effectively reduces the sequence length and leverages the prior morphological
information present in token sequence, which alleviates the modeling challenges
of token correlation. Through comprehensive investigations on a speech language
model trained with acoustic BPE, we confirm the notable advantages it offers,
including faster inference and improved syntax capturing capabilities. In
addition, we propose a novel rescore method to select the optimal synthetic
speech among multiple candidates generated by rich-diversity TTS system.
Experiments prove that rescore selection aligns closely with human preference,
which highlights acoustic BPE's potential to other speech generation tasks.Comment: 5 pages, 2 figures; accepted to ICASSP 202
Effects of Climate Change and Human Activities on Surface Runoff in the Luan River Basin
Quantifying the effects of climate change and human activities on runoff changes is the focus of climate change and hydrological research. This paper presents an integrated method employing the Budyko-based Fu model, hydrological modeling, and climate elasticity approaches to separate the effects of the two driving factors on surface runoff in the Luan River basin, China. The Budyko-based Fu model and the double mass curve method are used to analyze runoff changes during the period 1958~2009. Then two types of hydrological models (the distributed Soil and Water Assessment Tool model and the lumped SIMHYD model) and seven climate elasticity methods (including a nonparametric method and six Budyko-based methods) are applied to estimate the contributions of climate change and human activities to runoff change. The results show that all quantification methods are effective, and the results obtained by the nine methods are generally consistent. During the study period, the effects of climate change on runoff change accounted for 28.3~46.8% while those of human activities contributed with 53.2~71.7%, indicating that both factors have significant effects on the runoff decline in the basin, and that the effects of human activities are relatively stronger than those of climate change
Research on The Offensive Characteristics of La Liga Team Based on Social Network Analysis
To explore the difference of social network parameters between the network of passing before scoring and the network of passing before missing the goal, and to explore the correlation between social network parameters and team performance, this paper establishes the offensive pass network of 20 teams in the La Liga from 2017 to 2018, and 11 social network parameters are calculated. The Pearson correlation test is used to explore the linear correlation between 11 social network parameters and team performance. The results show that the linear correlation between the network parameters of passing before scoring and team performance is stronger than the network parameters of passing before missing the goal. According to the results, we can provide reliable and effective information to the football coaches to help improve the performance of football matches
Rapid detection of sulfamethoxazole in plasma and food samples with in-syringe membrane SPE coupled with solid-phase fluorescence spectrometry
© 2020 Elsevier Ltd In this work, in-syringe membrane solid-phase extraction (MSPE) device was fabricated for the on-site sampling of sulfamethoxazole (SMX) in food samples followed by solid-phase fluorescence spectra analysis. The samples and fluorescamine (FA) were added to a syringe for derivation. Then, the derivative of SMX was extracted by a membrane in the syringe SPE device. Subsequently, the derivative on the membrane was measured immediately without additional elution procedure. The method was successfully applied in plasma, milk, and egg samples for the trace SMX detection, with the recovery of 98%–102%, RSDs from 1% to 6%. Compared with liquid chromatography, direct detection of the concentrated analyte significantly improved the sensitivity. Moreover, fluorescamine made it unnecessary to separate SMX from the interference. Consequently, it was a time-saving, low-cost, and easy-operation method, which demonstrated the potential of in-syringe SPE as a promising candidate for on-site analysis
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
Self-supervised learning (SSL) proficiency in speech-related tasks has driven
research into utilizing discrete tokens for speech tasks like recognition and
translation, which offer lower storage requirements and great potential to
employ natural language processing techniques. However, these studies, mainly
single-task focused, faced challenges like overfitting and performance
degradation in speech recognition tasks, often at the cost of sacrificing
performance in multi-task scenarios. This study presents a comprehensive
comparison and optimization of discrete tokens generated by various leading SSL
models in speech recognition and synthesis tasks. We aim to explore the
universality of speech discrete tokens across multiple speech tasks.
Experimental results demonstrate that discrete tokens achieve comparable
results against systems trained on FBank features in speech recognition tasks
and outperform mel-spectrogram features in speech synthesis in subjective and
objective metrics. These findings suggest that universal discrete tokens have
enormous potential in various speech-related tasks. Our work is open-source and
publicly available to facilitate research in this direction
Chitosan/Al\u3csub\u3e2\u3c/sub\u3eO\u3csub\u3e3\u3c/sub\u3e-HA nanocomposite beads for efficient removal of estradiol and chrysoidin from aqueous solution
© 2019 Elsevier B.V. Alumina, as a support material, was loaded together with chitosan and hydroxyapatite to form chitosan/Al2O3-HA composite beads and was used for estradiol and chrysoidin removal from aqueous solution in the present work. The physicochemical properties of the beads were studied with Scanning Electron Microscopy (SEM), Fourier Transform Infrared Spectrometry (FTIR), thermogravimetric analysis (TGA) and Brunauer-Emmett-Teller (BET) surface area analysis. FTIR spectra confirmed that the chitosan was loaded successfully on Al2O3-HA, and functional groups were immobilized onto the surface of the beads after the synthesis. The adsorption condition including pH, the amount of adsorbent, initial concentration and time were evaluated during the batch experiments. Isotherm data best matched the Langmuir model and the pseudo-second-order model best described the adsorption kinetics. The maximum adsorption capacity was found to be 39.78 mg/g and 23.26 mg/g for estradiol and chrysoidine, respectively. The adsorbed estradiol and chrysoidin were completely eluted from the composite beads with the eluent of 0.1 M H2SO4/MeOH and the regenerated material was used in several cycles without deterioration in its initial performances. This study suggests that the developed composite beads have high potential for the efficient removal estradiol and chrysoidin from aqueous solution
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding
The utilization of discrete speech tokens, divided into semantic tokens and
acoustic tokens, has been proven superior to traditional acoustic feature
mel-spectrograms in terms of naturalness and robustness for text-to-speech
(TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow
zero-shot speaker adaptation through auto-regressive (AR) continuation of
acoustic tokens extracted from a short speech prompt. However, these AR models
are restricted to generate speech only in a left-to-right direction, making
them unsuitable for speech editing where both preceding and following contexts
are provided. Furthermore, these models rely on acoustic tokens, which have
audio quality limitations imposed by the performance of audio codec models. In
this study, we propose a unified context-aware TTS framework called UniCATS,
which is capable of both speech continuation and editing. UniCATS comprises two
components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav.
CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the
input text, enabling it to incorporate the semantic context and maintain
seamless concatenation with the surrounding context. Following that,
CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into
waveforms, taking into consideration the acoustic context. Our experimental
results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms
of speech resynthesis from semantic tokens. Moreover, we show that UniCATS
achieves state-of-the-art performance in both speech continuation and editing
- …