8 research outputs found
The Royalflush System for VoxCeleb Speaker Recognition Challenge 2022
In this technical report, we describe the Royalflush submissions for the
VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22). Our submissions
contain track 1, which is for supervised speaker verification and track 3,
which is for semi-supervised speaker verification. For track 1, we develop a
powerful U-Net-based speaker embedding extractor with a symmetric architecture.
The proposed system achieves 2.06% in EER and 0.1293 in MinDCF on the
validation set. Compared with the state-of-the-art ECAPA-TDNN, it obtains a
relative improvement of 20.7% in EER and 22.70% in MinDCF. For track 3, we
employ the joint training of source domain supervision and target domain
self-supervision to get a speaker embedding extractor. The subsequent
clustering process can obtain target domain pseudo-speaker labels. We adapt the
speaker embedding extractor using all source and target domain data in a
supervised manner, where it can fully leverage both domain information.
Moreover, clustering and supervised domain adaptation can be repeated until the
performance converges on the validation set. Our final submission is a fusion
of 10 models and achieves 7.75% EER and 0.3517 MinDCF on the validation set
LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement
Recently, researchers have shown an increasing interest in automatically
predicting the subjective evaluation for speech synthesis systems. This
prediction is a challenging task, especially on the out-of-domain test set. In
this paper, we proposed a novel fusion model for MOS prediction that combines
supervised and unsupervised approaches. In the supervised aspect, we developed
an SSL-based predictor called LE-SSL-MOS. The LE-SSL-MOS utilizes pre-trained
self-supervised learning models and further improves prediction accuracy by
utilizing the opinion scores of each utterance in the listener enhancement
branch. In the unsupervised aspect, two steps are contained: we fine-tuned the
unit language model (ULM) using highly intelligible domain data to improve the
correlation of an unsupervised metric - SpeechLMScore. Another is that we
utilized ASR confidence as a new metric with the help of ensemble learning. To
our knowledge, this is the first architecture that fuses supervised and
unsupervised methods for MOS prediction. With these approaches, our
experimental results on the VoiceMOS Challenge 2023 show that LE-SSL-MOS
performs better than the baseline. Our fusion system achieved an absolute
improvement of 13% over LE-SSL-MOS on the noisy and enhanced speech track. Our
system ranked 1st and 2nd, respectively, in the French speech synthesis track
and the challenge's noisy and enhanced speech track.Comment: accepted in IEEE-ASRU202
Hierarchical Prosody Analysis Improves Categorical and Dimensional Emotion Recognition
Extracting reliable speech features is one of the most fundamental difficulties in emotion recognition systems.
The extraction of spectral features has drawn much research attention but the extraction of prosody features, studying emotional cues, was often done by calculating statistics at an utterance level. However, the detailed prosody of different linguistic units can contain a large amount of emotion-related information. In this paper, we propose a novel hierarchical prosody analysis strategy by wavelet decomposition that models multi-level emotion transition phenomena. Our approach was evaluated on the IEMOCAP corpus and performed the best compared with state-of-the-art alternatives for both categorical and dimensional emotion recognition tasks, enabling the advancement of capturing dynamics in emotion expressions.13th Asia Pacific Signal and Information Processing Association Annual Summit and Conference 2021 (APSIPA ASC), 14-17 December 2021, Tokyo, Japa
Chiral and Polar Duality Design of Heteroanionic Compounds: Sr18Ge9O5S31 Based on [Sr3OGeS3]2+ and [Sr3SGeS3]2+ Groups
Abstract Chirality and polarity are the two most important and representative symmetryâdependent properties. For polar structures, all the twofold axes perpendicular to the principal axis of symmetry should be removed. For chiral structures, all the mirrorârelated symmetries and inversion axes should be removed. Especially for duality (polarity and chirality), all of the above symmetries should be broken and that also represents the highestâlevel challenge. Herein, a new symmetryâbreaking strategy that employs heteroanionic groups to construct hourglassâlike [Sr3OGeS3]2+ and [Sr3SGeS3]2+ groups to design and synthesize a new oxychalcogenide Sr18Ge9O5S31 with chiralâpolar duality is proposed. The presence of two enantiomers of Sr18Ge9O5S31 is confirmed by the singleâcrystal Xâray diffraction. Its optical activity and ferroelectricity are also studied by solidâstate circular dichroism spectroscopy and piezoresponse force microscopy, respectively. Further property measurements show that Sr18Ge9O5S31 possesses excellent nonlinear optical properties, including the strong second harmonic generation efficiency (â2.5 Ă AGS), large bandgap (3.61 eV), and wide midâinfrared transparent region (â15.3 ”m). These indicate that the unique microstructure groups of heteroanionic materials are conducive to realizing symmetryâbreaking and are able to provide some inspiration for exploring the chiralâpolar duality materials
Self-Raman 1176 nm Laser Generation from Nd:YVO<sub>4</sub> Crystal by Resonator Cavity Coating
Crystal coating is an important process in laser crystal applications. According to the crystal characteristics of neodymium-doped yttrium vanadate (Nd:YVO4), its intrinsic parameters, and optical film design theory, Ta2O5 and SiO2 were selected separately as high and low refractive index materials. The optical properties and surface roughness of the films were characterized by OptiLayer and Zygo interferometers, and the effects of ion source bias on refractive index and surface roughness were investigated so that the optimal ion source parameters were determined. Optical monitoring and quartz crystal control were combined to accurately control the thickness of each film layer and to reduce the monitoring error of film thickness. The prepared crystal device was successfully applied to the 1176 nm laser output system