3,161 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Learning Enriched Features for Real Image Restoration and Enhancement
With the goal of recovering high-quality image content from its degraded
version, image restoration enjoys numerous applications, such as in
surveillance, computational photography, medical imaging, and remote sensing.
Recently, convolutional neural networks (CNNs) have achieved dramatic
improvements over conventional approaches for image restoration task. Existing
CNN-based methods typically operate either on full-resolution or on
progressively low-resolution representations. In the former case, spatially
precise but contextually less robust results are achieved, while in the latter
case, semantically reliable but spatially less accurate outputs are generated.
In this paper, we present a novel architecture with the collective goals of
maintaining spatially-precise high-resolution representations through the
entire network and receiving strong contextual information from the
low-resolution representations. The core of our approach is a multi-scale
residual block containing several key elements: (a) parallel multi-resolution
convolution streams for extracting multi-scale features, (b) information
exchange across the multi-resolution streams, (c) spatial and channel attention
mechanisms for capturing contextual information, and (d) attention based
multi-scale feature aggregation. In a nutshell, our approach learns an enriched
set of features that combines contextual information from multiple scales,
while simultaneously preserving the high-resolution spatial details. Extensive
experiments on five real image benchmark datasets demonstrate that our method,
named as MIRNet, achieves state-of-the-art results for a variety of image
processing tasks, including image denoising, super-resolution, and image
enhancement. The source code and pre-trained models are available at
https://github.com/swz30/MIRNet.Comment: Accepted for publication at ECCV 202
LadleNet: Translating Thermal Infrared Images to Visible Light Images Using A Scalable Two-stage U-Net
The translation of thermal infrared (TIR) images to visible light (VI) images
presents a challenging task with potential applications spanning various
domains such as TIR-VI image registration and fusion. Leveraging supplementary
information derived from TIR image conversions can significantly enhance model
performance and generalization across these applications. However, prevailing
issues within this field include suboptimal image fidelity and limited model
scalability. In this paper, we introduce an algorithm, LadleNet, based on the
U-Net architecture. LadleNet employs a two-stage U-Net concatenation structure,
augmented with skip connections and refined feature aggregation techniques,
resulting in a substantial enhancement in model performance. Comprising
'Handle' and 'Bowl' modules, LadleNet's Handle module facilitates the
construction of an abstract semantic space, while the Bowl module decodes this
semantic space to yield mapped VI images. The Handle module exhibits
extensibility by allowing the substitution of its network architecture with
semantic segmentation networks, thereby establishing more abstract semantic
spaces to bolster model performance. Consequently, we propose LadleNet+, which
replaces LadleNet's Handle module with the pre-trained DeepLabv3+ network,
thereby endowing the model with enhanced semantic space construction
capabilities. The proposed method is evaluated and tested on the KAIST dataset,
accompanied by quantitative and qualitative analyses. Compared to existing
methodologies, our approach achieves state-of-the-art performance in terms of
image clarity and perceptual quality. The source code will be made available at
https://github.com/Ach-1914/LadleNet/tree/main/
- …