1,207 research outputs found
WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability
Transformer and its variants are fundamental neural architectures in deep
learning. Recent works show that learning attention in the Fourier space can
improve the long sequence learning capability of Transformers. We argue that
wavelet transform shall be a better choice because it captures both position
and frequency information with linear time complexity. Therefore, in this
paper, we systematically study the synergy between wavelet transform and
Transformers. We propose Wavelet Space Attention (WavSpA) that facilitates
attention learning in a learnable wavelet coefficient space which replaces the
attention in Transformers by (1) applying forward wavelet transform to project
the input sequences to multi-resolution bases, (2) conducting attention
learning in the wavelet coefficient space, and (3) reconstructing the
representation in input space via backward wavelet transform. Extensive
experiments on the Long Range Arena demonstrate that learning attention in the
wavelet space using either fixed or adaptive wavelets can consistently improve
Transformer's performance and also significantly outperform learning in Fourier
space. We further show our method can enhance Transformer's reasoning
extrapolation capability over distance on the LEGO chain-of-reasoning task
Dior-CVAE: Diffusion Priors in Variational Dialog Generation
Conditional variational autoencoders (CVAEs) have been used recently for
diverse response generation, by introducing latent variables to represent the
relationship between a dialog context and its potential responses. However, the
diversity of the generated responses brought by a CVAE model is limited due to
the oversimplified assumption of the isotropic Gaussian prior. We propose,
Dior-CVAE, a hierarchical CVAE model with an informative prior produced by a
diffusion model. Dior-CVAE derives a series of layer-wise latent variables
using attention mechanism and infusing them into decoder layers accordingly. We
propose memory dropout in the latent infusion to alleviate posterior collapse.
The prior distribution of the latent variables is parameterized by a diffusion
model to introduce a multimodal distribution. Overall, experiments on two
popular open-domain dialog datasets indicate the advantages of our approach
over previous Transformer-based variational dialog models in dialog response
generation. We publicly release the code for reproducing Dior-CVAE and all
baselines at
https://github.com/SkyFishMoon/Latent-Diffusion-Response-Generation
Text-based Editing of Talking-head Video
Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis
- …