30 research outputs found
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition
The Transformer architecture model, based on self-attention and multi-head
attention, has achieved remarkable success in offline end-to-end Automatic
Speech Recognition (ASR). However, self-attention and multi-head attention
cannot be easily applied for streaming or online ASR. For self-attention in
Transformer ASR, the softmax normalization function-based attention mechanism
makes it impossible to highlight important speech information. For multi-head
attention in Transformer ASR, it is not easy to model monotonic alignments in
different heads. To overcome these two limits, we integrate sparse attention
and monotonic attention into Transformer-based ASR. The sparse mechanism
introduces a learned sparsity scheme to enable each self-attention structure to
fit the corresponding head better. The monotonic attention deploys
regularization to prune redundant heads for the multi-head attention structure.
The experiments show that our method can effectively improve the attention
mechanism on widely used benchmarks of speech recognition.Comment: Accepted to DSAA 202
High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field
One crucial aspect of 3D head avatar reconstruction lies in the details of
facial expressions. Although recent NeRF-based photo-realistic 3D head avatar
methods achieve high-quality avatar rendering, they still encounter challenges
retaining intricate facial expression details because they overlook the
potential of specific expression variations at different spatial positions when
conditioning the radiance field. Motivated by this observation, we introduce a
novel Spatially-Varying Expression (SVE) conditioning. The SVE can be obtained
by a simple MLP-based generation network, encompassing both spatial positional
features and global expression information. Benefiting from rich and diverse
information of the SVE at different positions, the proposed SVE-conditioned
neural radiance field can deal with intricate facial expressions and achieve
realistic rendering and geometry details of high-fidelity 3D head avatars.
Additionally, to further elevate the geometric and rendering quality, we
introduce a new coarse-to-fine training strategy, including a geometry
initialization strategy at the coarse stage and an adaptive importance sampling
strategy at the fine stage. Extensive experiments indicate that our method
outperforms other state-of-the-art (SOTA) methods in rendering and geometry
quality on mobile phone-collected and public datasets.Comment: 9 pages, 5 figure
Printing of Fine Metal Electrodes for Organic ThinâFilm Transistors
Attributed to the excellent mechanical flexibility and compatibility with lowâcost and highâthroughput printing processes, the organic thinâfilm transistor (OTFT) is a promising technology of choice for a wide range of flexible and largeâarea electronics applications. Among various printing techniques, the dropâonâdemand inkjet printing is one of the most versatile ones to form patterned electrodes with the advantages of maskâless patterning, nonâcontact, low cost, and scalability to largeâarea manufacturing. However, the limited positional accuracy of the inkjet printer system and the spreading of the ink droplets on the substrate surface, which is influenced by both the ink properties and the substrate surface energy, make it difficult to obtain fineâline morphologies and define the exact channel length as required, especially for relatively narrowâline and shortâchannel patterns. This chapter introduces the printing of uniform fine silver electrodes and down scaling of the channel length by controlling ink wetting on polymer substrate. Allâsolutionâprocessed/printable OTFTs with short channels (<20 ”m) are also demonstrated by incorporating fine inkjetâprinted silver electrodes into a lowâvoltage (<3 V) OTFT architecture. This work would provide a commercially competitive manufacturing approach to developing printable lowâvoltage OTFTs for lowâpower electronics applications
Video Deepfake Classification Using Particle Swarm Optimization-based Evolving Ensemble Models
The recent breakthrough of deep learning based generative models has led to the escalated generation of photo-realistic synthetic videos with significant visual quality. Automated reliable detection of such forged videos requires the extraction of fine-grained discriminative spatial-temporal cues. To tackle such challenges, we propose weighted and evolving ensemble models comprising 3D Convolutional Neural Networks (CNNs) and CNN-Recurrent Neural Networks (RNNs) with Particle Swarm Optimization (PSO) based network topology and hyper-parameter optimization for video authenticity classification. A new PSO algorithm is proposed, which embeds Mullerâs method and fixed-point iteration based leader enhancement, reinforcement learning-based optimal search action selection, a petal spiral simulated search mechanism, and cross-breed elite signal generation based on adaptive geometric surfaces. The PSO variant optimizes the RNN topologies in CNN-RNN, as well as key learning configurations of 3D CNNs, with the attempt to extract effective discriminative spatial-temporal cues. Both weighted and evolving ensemble strategies are used for ensemble formulation with aforementioned optimized networks as base classifiers. In particular, the proposed PSO algorithm is used to identify optimal subsets of optimized base networks for dynamic ensemble generation to balance between ensemble complexity and performance. Evaluated using several well-known synthetic video datasets, our approach outperforms existing studies and various ensemble models devised by other search methods with statistical significance for video authenticity classification. The proposed PSO model also illustrates statistical superiority over a number of search methods for solving optimization problems pertaining to a variety of artificial landscapes with diverse geometrical layouts
VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection
In recent years, transformer-based detectors have demonstrated remarkable
performance in 2D visual perception tasks. However, their performance in
multi-view 3D object detection remains inferior to the state-of-the-art (SOTA)
of convolutional neural network based detectors. In this work, we investigate
this issue from the perspective of bird's-eye-view (BEV) feature generation.
Specifically, we examine the BEV feature generation method employed by the
transformer-based SOTA, BEVFormer, and identify its two limitations: (i) it
only generates attention weights from BEV, which precludes the use of lidar
points for supervision, and (ii) it aggregates camera view features to the BEV
through deformable sampling, which only selects a small subset of features and
fails to exploit all information. To overcome these limitations, we propose a
novel BEV feature generation method, dual-view attention, which generates
attention weights from both the BEV and camera view. This method encodes all
camera features into the BEV feature. By combining dual-view attention with the
BEVFormer architecture, we build a new detector named VoxelFormer. Extensive
experiments are conducted on the nuScenes benchmark to verify the superiority
of dual-view attention and VoxelForer. We observe that even only adopting 3
encoders and 1 historical frame during training, VoxelFormer still outperforms
BEVFormer significantly. When trained in the same setting, VoxelFormer can
surpass BEVFormer by 4.9% NDS point. Code is available at:
https://github.com/Lizhuoling/VoxelFormer-public.git
3D Face Arbitrary Style Transfer
Style transfer of 3D faces has gained more and more attention. However,
previous methods mainly use images of artistic faces for style transfer while
ignoring arbitrary style images such as abstract paintings. To solve this
problem, we propose a novel method, namely Face-guided Dual Style Transfer
(FDST). To begin with, FDST employs a 3D decoupling module to separate facial
geometry and texture. Then we propose a style fusion strategy for facial
geometry. Subsequently, we design an optimization-based DDSG mechanism for
textures that can guide the style transfer by two style images. Besides the
normal style image input, DDSG can utilize the original face input as another
style input as the face prior. By this means, high-quality face arbitrary style
transfer results can be obtained. Furthermore, FDST can be applied in many
downstream tasks, including region-controllable style transfer, high-fidelity
face texture reconstruction, large-pose face reconstruction, and artistic face
reconstruction. Comprehensive quantitative and qualitative results show that
our method can achieve comparable performance. All source codes and pre-trained
weights will be released to the public