336 research outputs found
Speech Enhancement with Multi-granularity Vector Quantization
With advances in deep learning, neural network based speech enhancement (SE)
has developed rapidly in the last decade. Meanwhile, the self-supervised
pre-trained model and vector quantization (VQ) have achieved excellent
performance on many speech-related tasks, while they are less explored on SE.
As it was shown in our previous work that utilizing a VQ module to discretize
noisy speech representations is beneficial for speech denoising, in this work
we therefore study the impact of using VQ at different layers with different
number of codebooks. Different VQ modules indeed enable to extract
multiple-granularity speech features. Following an attention mechanism, the
contextual features extracted by a pre-trained model are fused with the local
features extracted by the encoder, such that both global and local information
are preserved to reconstruct the enhanced speech. Experimental results on the
Valentini dataset show that the proposed model can improve the SE performance,
where the impact of choosing pre-trained models is also revealed
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding
Recent research has demonstrated impressive results in video-to-speech
synthesis which involves reconstructing speech solely from visual input.
However, previous works have struggled to accurately synthesize speech due to a
lack of sufficient guidance for the model to infer the correct content with the
appropriate sound. To resolve the issue, they have adopted an extra speaker
embedding as a speaking style guidance from a reference auditory information.
Nevertheless, it is not always possible to obtain the audio information from
the corresponding video input, especially during the inference time. In this
paper, we present a novel vision-guided speaker embedding extractor using a
self-supervised pre-trained model and prompt tuning technique. In doing so, the
rich speaker embedding information can be produced solely from input visual
information, and the extra audio information is not necessary during the
inference time. Using the extracted vision-guided speaker embedding
representations, we further develop a diffusion-based video-to-speech synthesis
model, so called DiffV2S, conditioned on those speaker embeddings and the
visual representation extracted from the input video. The proposed DiffV2S not
only maintains phoneme details contained in the input video frames, but also
creates a highly intelligible mel-spectrogram in which the speaker identities
of the multiple speakers are all preserved. Our experimental results show that
DiffV2S achieves the state-of-the-art performance compared to the previous
video-to-speech synthesis technique.Comment: ICCV 202
A 16-Channel Neural Recording System-on-Chip With CHT Feature Extraction Processor in 65-nm CMOS
Next-generation invasive neural interfaces require fully implantable wireless systems that can record from a large number of channels simultaneously. However, transferring the recorded data from the implant to an external receiver emerges as a significant challenge due to the high throughput. To address this challenge, this article presents a neural recording system-on-chip that achieves high resource and wireless bandwidth efficiency by employing on-chip feature extraction. Energy-area-efficient 10-bit 20-kS/s front end amplifies and digitizes the neural signals within the local field potential (LFP) and action potential (AP) bands. The raw data from each channel are decomposed into spectral features using a compressed Hadamard transform (CHT) processor. The selection of the features to be computed is tailored through a machine learning algorithm such that the overall data rate is reduced by 80% without compromising classification performance. Moreover, the CHT feature extractor allows waveform reconstruction on the receiver side for monitoring or additional post-processing. The proposed approach was validated through in vivo and off-line experiments. The prototype fabricated in 65-nm CMOS also includes wireless power and data receiver blocks to demonstrate the energy and area efficiency of the complete system. The overall signal chain consumes 2.6 μW and occupies 0.021 mm² per channel, pointing toward its feasibility for 1000-channel single-die neural recording systems
Music Augmentation and Denoising For Peak-Based Audio Fingerprinting
Audio fingerprinting is a well-established solution for song identification
from short recording excerpts. Popular methods rely on the extraction of sparse
representations, generally spectral peaks, and have proven to be accurate,
fast, and scalable to large collections. However, real-world applications of
audio identification often happen in noisy environments, which can cause these
systems to fail. In this work, we tackle this problem by introducing and
releasing a new audio augmentation pipeline that adds noise to music snippets
in a realistic way, by stochastically mimicking real-world scenarios. We then
propose and release a deep learning model that removes noisy components from
spectrograms in order to improve peak-based fingerprinting systems' accuracy.
We show that the addition of our model improves the identification performance
of commonly used audio fingerprinting systems, even under noisy conditions
Low-Complexity Audio Embedding Extractors
Solving tasks such as speaker recognition, music classification, or semantic
audio event tagging with deep learning models typically requires
computationally demanding networks. General-purpose audio embeddings (GPAEs)
are dense representations of audio signals that allow lightweight, shallow
classifiers to tackle various audio tasks. The idea is that a single complex
feature extractor would extract dense GPAEs, while shallow MLPs can produce
task-specific predictions. If the extracted dense representations are general
enough to allow the simple downstream classifiers to generalize to a variety of
tasks in the audio domain, a single costly forward pass suffices to solve
multiple tasks in parallel. In this work, we try to reduce the cost of GPAE
extractors to make them suitable for resource-constrained devices. We use
efficient MobileNets trained on AudioSet using Knowledge Distillation from a
Transformer ensemble as efficient GPAE extractors. We explore how to obtain
high-quality GPAEs from the model, study how model complexity relates to the
quality of extracted GPAEs, and conclude that low-complexity models can
generate competitive GPAEs, paving the way for analyzing audio streams on edge
devices w.r.t. multiple audio classification and recognition tasks.Comment: In Proceedings of the 31st European Signal Processing Conference,
EUSIPCO 2023. Source Code available at:
https://github.com/fschmid56/EfficientAT_HEA
- …