Search CORE

73 research outputs found

Deep Cross-Modal Audio-Visual Generation

Author: Chen Lele
Duan Zhiyao
Srivastava Sudhanshu
Xu Chenliang
Publication venue
Publication date: 01/01/2017
Field of study

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite works in computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluations demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space

arXiv.org e-Print Archive

Crossref

ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed

Author: Chen Meiying
Duan Zhiyao
Publication venue
Publication date: 03/06/2023
Field of study

Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and Speed is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the controls. In this paper, we propose ControlVC, the first neural voice conversion system that achieves time-varying controls on pitch and speed. ControlVC uses pre-trained encoders to compute pitch and linguistic embeddings from the source utterance and speaker embeddings from the target utterance. These embeddings are then concatenated and converted to speech using a vocoder. It achieves speed control through TD-PSOLA pre-processing on the source utterance, and achieves pitch control by manipulating the pitch contour before feeding it to the pitch encoder. Systematic subjective and objective evaluations are conducted to assess the speech quality and controllability. Results show that, on non-parallel and zero-shot conversion tasks, ControlVC significantly outperforms two other self-constructed baselines on speech quality, and it can successfully achieve time-varying pitch and speed control.Comment: Audio samples: https://bit.ly/3PsrKLJ; Code: https://github.com/MelissaChen15/control-v

arXiv.org e-Print Archive

Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss

Author: Chen Lele
Duan Zhiyao
Maddox Ross K.
Xu Chenliang
Publication venue
Publication date: 09/05/2019
Field of study

We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons

arXiv.org e-Print Archive

Crossref

SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing

Author: Ding Siwen
Duan Zhiyao
Zhang You
Publication venue
Publication date: 04/11/2022
Field of study

Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consideration of the diversity of speakers. In this work, we propose speaker attractor multi-center one-class learning (SAMO), which clusters bona fide speech around a number of speaker attractors and pushes away spoofing attacks from all the attractors in a high-dimensional embedding space. For training, we propose an algorithm for the co-optimization of bona fide speech clustering and bona fide/spoof classification. For inference, we propose strategies to enable anti-spoofing for speakers without enrollment. Our proposed system outperforms existing state-of-the-art single systems with a relative improvement of 38% on equal error rate (EER) on the ASVspoof2019 LA evaluation set

arXiv.org e-Print Archive

Multi-Scale Simulation of Surface Segregation and Oxygen Reduction Reaction on Platinum Alloy Surface: Density Functional Theory, Monte Carlo Simulation, and Kinetic Analysis

Author: Duan Zhiyao
Publication venue
Publication date: 25/09/2013
Field of study

Proton-exchange membrane fuel cell (PEMFC) is an electrochemical device for directly converting chemical energy stored in hydrogen fuel to electricity at low temperature. In order to improve the efficiency and reduce the cost of PEMFC, numerous research efforts have been devoted to developing cheaper yet more efficient electrocatalysts by alloying Pt with 3d transition metals. A number of Pt-based alloys were identified to have better activity for catalyzing ORR than pure Pt catalysts. Although Pt-based catalysts have better ORR activity, the underlying reasons for the improvement are unclear and controversial. Atomistic simulation can provide molecular-level information of physical and chemical processes in materials. In my research works, the state-of-art simulation methods were employed to elucidate the surface structures of Pt-based alloys, reaction mechanism of ORR on Pt and Pt-based surfaces, and degradation of Pt and Pt-based nanoparticle catalysts. Based on the research topics, my research is primarily composed of four parts as followed. Firstly, surface segregation phenomena in Pt3Ti, Pt-Pd, and Pt3Fe alloys were investigated with density functional theory (DFT) and Monte Carlo (MC) simulations. Through the computational study, the driving forces and mechanisms of surface segregation were clarified, and the surface composition proﬁles were quantitatively predicted. For example, the DFT simulation suggested that off-stoichiometric effect accounted for the experimentally observed Pt segregation to the outermost layer of the Pt3Ti (111). Our MC simulations predicted in a Pt3Ti (111) sample with a Pt concentration slightly above 75 at. %, Pt atoms would segregate to the surface to form a pure Pt outermost layer, while the ordered Pt3Ti crystal structure would be maintained in the second layer and below. The knowledge of surface structures of Pt-based alloys was acquired through the surface segregation study, which set the ground for further studies on catalytic properties of the surfaces of the alloys. Secondly, ﬁrst-principles DFT calculations was employed to elucidate the reaction mechanism of ORR on Pt and Pt/M (111) and (100) surfaces (M = Ni, Co, Fe). The binding strengths of chemical intermediates involved in ORR are less strongly on Pt/M surfaces compared to the pure Pt couterparts due to the modiﬁed electronic structure of the Pt overlayer by the subsurface transition metals. ORR mechanism is also shifted on modified Pt overlayers. It was found that ORR proceeds through OOH dissociation mechanism on Pt (111) surface, while on Pt/M (111) surfaces ORR proceeds through HOOH dissociation mechanism. The significance of the changed ORR mechanism is that ORR activity measured by the barrier of rate-determining step is greatly enhanced on Pt/M (111) surfaces. For example, on Pt/Ni (111) surface, O2 hydrogenation is the rate-determining step with a barrier of 0.15 eV compared to O hydrogenation with 0.79 eV on Pt (111) surface. We also determined ORR mechanism on Pt (100) and Pt/Ni (100) surface to be O2 dissociation mechanism. There is no mechanism change between Pt (100) and Pt/Ni (100) surfaces since the subsurface Ni has much less effect on (100) surface than that on (111) surface. The results from our calculations give an explanation of experimentally observed enhancement of ORR activity on Pt/M (111) surface and relative ORR activity between Pt (111) and Pt (100) surfaces. In the third part, kinetic Monte Carlo (KMC) algorithm is implemented to study the kinetics of ORR based on the mechanistic information obtained in the second study. The information of the elementary reactions involved in ORR such as the adsorption sites of the reactants and products, activation energies, etc. is input into the KMC code. The KMC simulation can simulate the dynamics of ORR and output the current density (joules/cm2/s) generated from the reactions. Then, the simulated current density which is a measure of ORR activity can be directly compared to experimental measurement. In the study, kinetics of ORR on Pt (111) and Pt (100) surfaces were simulated. The simulated current density of ORR on Pt (111) and Pt (100) at electrode potential 0.8 V is in the same magnitude with experimental measurement, although the actual value is about 2 times lower. The reasonable agreement with experiments also in turn indicates that the previous mechanistic study is reliable. Expect for the activity issue, Pt nanoparticle catalyst also faces degradation problem due to the highly oxidizing environment in the cathode of PEMFC. In the ﬁnal part, the degradation of Pt nanoparticle catalyst through Pt dissolution is studied employing grand-canonical Monte Carlo (GCMC) simulation. Pt dissolution process was found to be initialized through the dissolution of under-coordinated Pt atoms sitting on the corners and edges of the nanoparticle. After the initial dissolution of Pt atoms on corners and edges, more under-coordinated Pt atoms are generated and the dissolution process is accelerating. The smaller Pt nanoparticle is more vulnerable to the Pt dissolution process than the larger nanoparticle. A Pt nanoparticle with about 5 nm diameter is stable in the environment. It was also found that Au atoms segregated to the under-coordinated sites would stabilize the nanoparticle because Au atoms will not dissolute and the dissolution process will not be initialized. The simulation explains the stabilizing effect of Au observed in the experiments

D-Scholarship@Pitt