327 research outputs found

    Speech Enhancement with Multi-granularity Vector Quantization

    Full text link
    With advances in deep learning, neural network based speech enhancement (SE) has developed rapidly in the last decade. Meanwhile, the self-supervised pre-trained model and vector quantization (VQ) have achieved excellent performance on many speech-related tasks, while they are less explored on SE. As it was shown in our previous work that utilizing a VQ module to discretize noisy speech representations is beneficial for speech denoising, in this work we therefore study the impact of using VQ at different layers with different number of codebooks. Different VQ modules indeed enable to extract multiple-granularity speech features. Following an attention mechanism, the contextual features extracted by a pre-trained model are fused with the local features extracted by the encoder, such that both global and local information are preserved to reconstruct the enhanced speech. Experimental results on the Valentini dataset show that the proposed model can improve the SE performance, where the impact of choosing pre-trained models is also revealed

    DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

    Full text link
    Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations, we further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. The proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.Comment: ICCV 202

    A 16-Channel Neural Recording System-on-Chip With CHT Feature Extraction Processor in 65-nm CMOS

    Get PDF
    Next-generation invasive neural interfaces require fully implantable wireless systems that can record from a large number of channels simultaneously. However, transferring the recorded data from the implant to an external receiver emerges as a significant challenge due to the high throughput. To address this challenge, this article presents a neural recording system-on-chip that achieves high resource and wireless bandwidth efficiency by employing on-chip feature extraction. Energy-area-efficient 10-bit 20-kS/s front end amplifies and digitizes the neural signals within the local field potential (LFP) and action potential (AP) bands. The raw data from each channel are decomposed into spectral features using a compressed Hadamard transform (CHT) processor. The selection of the features to be computed is tailored through a machine learning algorithm such that the overall data rate is reduced by 80% without compromising classification performance. Moreover, the CHT feature extractor allows waveform reconstruction on the receiver side for monitoring or additional post-processing. The proposed approach was validated through in vivo and off-line experiments. The prototype fabricated in 65-nm CMOS also includes wireless power and data receiver blocks to demonstrate the energy and area efficiency of the complete system. The overall signal chain consumes 2.6 μW and occupies 0.021 mm² per channel, pointing toward its feasibility for 1000-channel single-die neural recording systems

    Music Augmentation and Denoising For Peak-Based Audio Fingerprinting

    Full text link
    Audio fingerprinting is a well-established solution for song identification from short recording excerpts. Popular methods rely on the extraction of sparse representations, generally spectral peaks, and have proven to be accurate, fast, and scalable to large collections. However, real-world applications of audio identification often happen in noisy environments, which can cause these systems to fail. In this work, we tackle this problem by introducing and releasing a new audio augmentation pipeline that adds noise to music snippets in a realistic way, by stochastically mimicking real-world scenarios. We then propose and release a deep learning model that removes noisy components from spectrograms in order to improve peak-based fingerprinting systems' accuracy. We show that the addition of our model improves the identification performance of commonly used audio fingerprinting systems, even under noisy conditions

    Low-Complexity Audio Embedding Extractors

    Full text link
    Solving tasks such as speaker recognition, music classification, or semantic audio event tagging with deep learning models typically requires computationally demanding networks. General-purpose audio embeddings (GPAEs) are dense representations of audio signals that allow lightweight, shallow classifiers to tackle various audio tasks. The idea is that a single complex feature extractor would extract dense GPAEs, while shallow MLPs can produce task-specific predictions. If the extracted dense representations are general enough to allow the simple downstream classifiers to generalize to a variety of tasks in the audio domain, a single costly forward pass suffices to solve multiple tasks in parallel. In this work, we try to reduce the cost of GPAE extractors to make them suitable for resource-constrained devices. We use efficient MobileNets trained on AudioSet using Knowledge Distillation from a Transformer ensemble as efficient GPAE extractors. We explore how to obtain high-quality GPAEs from the model, study how model complexity relates to the quality of extracted GPAEs, and conclude that low-complexity models can generate competitive GPAEs, paving the way for analyzing audio streams on edge devices w.r.t. multiple audio classification and recognition tasks.Comment: In Proceedings of the 31st European Signal Processing Conference, EUSIPCO 2023. Source Code available at: https://github.com/fschmid56/EfficientAT_HEA
    corecore