73 research outputs found
Deep Cross-Modal Audio-Visual Generation
Cross-modal audio-visual perception has been a long-lasting topic in
psychology and neurology, and various studies have discovered strong
correlations in human perception of auditory and visual stimuli. Despite works
in computational multimodal modeling, the problem of cross-modal audio-visual
generation has not been systematically studied in the literature. In this
paper, we make the first attempt to solve this cross-modal generation problem
leveraging the power of deep generative adversarial training. Specifically, we
use conditional generative adversarial networks to achieve cross-modal
audio-visual generation of musical performances. We explore different encoding
methods for audio and visual signals, and work on two scenarios:
instrument-oriented generation and pose-oriented generation. Being the first to
explore this new problem, we compose two new datasets with pairs of images and
sounds of musical performances of different instruments. Our experiments using
both classification and human evaluations demonstrate that our model has the
ability to generate one modality, i.e., audio/visual, from the other modality,
i.e., visual/audio, to a good extent. Our experiments on various design choices
along with the datasets will facilitate future research in this new problem
space
ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed
Recent developments in neural speech synthesis and vocoding have sparked a
renewed interest in voice conversion (VC). Beyond timbre transfer, achieving
controllability on para-linguistic parameters such as pitch and Speed is
critical in deploying VC systems in many application scenarios. Existing
studies, however, either only provide utterance-level global control or lack
interpretability on the controls. In this paper, we propose ControlVC, the
first neural voice conversion system that achieves time-varying controls on
pitch and speed. ControlVC uses pre-trained encoders to compute pitch and
linguistic embeddings from the source utterance and speaker embeddings from the
target utterance. These embeddings are then concatenated and converted to
speech using a vocoder. It achieves speed control through TD-PSOLA
pre-processing on the source utterance, and achieves pitch control by
manipulating the pitch contour before feeding it to the pitch encoder.
Systematic subjective and objective evaluations are conducted to assess the
speech quality and controllability. Results show that, on non-parallel and
zero-shot conversion tasks, ControlVC significantly outperforms two other
self-constructed baselines on speech quality, and it can successfully achieve
time-varying pitch and speed control.Comment: Audio samples: https://bit.ly/3PsrKLJ; Code:
https://github.com/MelissaChen15/control-v
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing
Voice anti-spoofing systems are crucial auxiliaries for automatic speaker
verification (ASV) systems. A major challenge is caused by unseen attacks
empowered by advanced speech synthesis technologies. Our previous research on
one-class learning has improved the generalization ability to unseen attacks by
compacting the bona fide speech in the embedding space. However, such
compactness lacks consideration of the diversity of speakers. In this work, we
propose speaker attractor multi-center one-class learning (SAMO), which
clusters bona fide speech around a number of speaker attractors and pushes away
spoofing attacks from all the attractors in a high-dimensional embedding space.
For training, we propose an algorithm for the co-optimization of bona fide
speech clustering and bona fide/spoof classification. For inference, we propose
strategies to enable anti-spoofing for speakers without enrollment. Our
proposed system outperforms existing state-of-the-art single systems with a
relative improvement of 38% on equal error rate (EER) on the ASVspoof2019 LA
evaluation set
Multi-Scale Simulation of Surface Segregation and Oxygen Reduction Reaction on Platinum Alloy Surface: Density Functional Theory, Monte Carlo Simulation, and Kinetic Analysis
Proton-exchange membrane fuel cell (PEMFC) is an electrochemical device for directly converting chemical energy stored in hydrogen fuel to electricity at low temperature. In order to improve the efficiency and reduce the cost of PEMFC, numerous research efforts have been devoted to developing cheaper yet more efficient electrocatalysts by alloying Pt with 3d transition metals. A number of Pt-based alloys were identified to have better activity for catalyzing ORR than pure Pt catalysts. Although Pt-based catalysts have better ORR activity, the underlying reasons for the improvement are unclear and controversial. Atomistic simulation can provide molecular-level information of physical and chemical processes in materials. In my research works, the state-of-art simulation methods were employed to elucidate the surface structures of Pt-based alloys, reaction mechanism of ORR on Pt and Pt-based surfaces, and degradation of Pt and Pt-based nanoparticle catalysts. Based on the
research topics, my research is primarily composed of four parts as followed.
Firstly, surface segregation phenomena in Pt3Ti, Pt-Pd, and Pt3Fe alloys were investigated with density functional theory (DFT) and Monte Carlo (MC) simulations. Through the computational study, the driving forces and mechanisms of surface segregation were clarified, and the surface composition proļ¬les were quantitatively predicted. For example, the DFT simulation suggested that off-stoichiometric effect accounted for the experimentally observed Pt segregation to the outermost layer of the Pt3Ti (111). Our MC simulations predicted in a Pt3Ti (111) sample with a Pt concentration slightly above 75 at. %, Pt atoms would segregate to the surface to form a pure Pt outermost layer, while the ordered Pt3Ti crystal structure would be maintained in the second layer and below. The knowledge of
surface structures of Pt-based alloys was acquired through the surface segregation study, which set the ground for further studies on catalytic properties of the surfaces of the alloys.
Secondly, ļ¬rst-principles DFT calculations was employed to elucidate the reaction mechanism of ORR on Pt and Pt/M (111) and (100) surfaces (M = Ni, Co, Fe). The binding strengths of chemical intermediates involved in ORR are less strongly on Pt/M surfaces compared to the pure Pt couterparts due to the modiļ¬ed electronic structure of the Pt overlayer by the subsurface transition metals. ORR mechanism is also shifted on modified Pt overlayers. It was found that ORR proceeds through OOH dissociation mechanism on Pt (111) surface, while on Pt/M (111) surfaces ORR proceeds through HOOH dissociation mechanism. The significance of the changed ORR mechanism is that ORR activity measured by the barrier of rate-determining step is greatly enhanced on Pt/M (111) surfaces. For example, on Pt/Ni (111) surface, O2 hydrogenation is the rate-determining step with a barrier of 0.15 eV compared to O hydrogenation with 0.79 eV on Pt (111) surface. We
also determined ORR mechanism on Pt (100) and Pt/Ni (100) surface to be O2 dissociation mechanism. There is no mechanism change between Pt (100) and Pt/Ni (100) surfaces since the subsurface Ni has much less effect on (100) surface than that on (111) surface. The results from our calculations give an explanation of experimentally observed enhancement of ORR activity on Pt/M (111) surface and relative ORR activity between Pt (111) and Pt (100) surfaces.
In the third part, kinetic Monte Carlo (KMC) algorithm is implemented to study the kinetics of ORR based on the mechanistic information obtained in the second study. The information of the elementary reactions involved in ORR such as the adsorption sites of the reactants and products, activation energies, etc. is input into the KMC code. The KMC simulation can simulate the dynamics of ORR and output the current density (joules/cm2/s) generated from the reactions. Then, the simulated current density which is a measure of ORR activity can be directly compared to experimental measurement. In the study, kinetics of ORR on Pt (111) and Pt (100) surfaces were simulated. The simulated current density of ORR on Pt (111) and Pt (100) at electrode potential 0.8 V is in the same magnitude
with experimental measurement, although the actual value is about 2 times lower. The reasonable agreement with experiments also in turn indicates that the previous mechanistic study is reliable.
Expect for the activity issue, Pt nanoparticle catalyst also faces degradation problem due to the highly oxidizing environment in the cathode of PEMFC. In the ļ¬nal part, the degradation of Pt nanoparticle catalyst through Pt dissolution is studied employing grand-canonical Monte Carlo (GCMC) simulation. Pt dissolution process was found to be initialized through the dissolution of under-coordinated Pt atoms sitting on the corners and edges of the nanoparticle. After the initial dissolution of Pt atoms on corners and edges, more under-coordinated Pt atoms are generated and the dissolution process is accelerating. The smaller Pt nanoparticle is more vulnerable to the Pt dissolution process than the larger nanoparticle. A Pt nanoparticle with about 5 nm diameter is stable in the environment. It was also found that Au atoms segregated to the under-coordinated sites would stabilize the nanoparticle
because Au atoms will not dissolute and the dissolution process will not be initialized. The simulation explains the stabilizing effect of Au observed in the experiments
- ā¦