Search CORE

2,248 research outputs found

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Author: Geiger Jürgen
Jin Wenyu
Mousa Amr El-Desoky
Pohjalainen Jouni
Schuller Björn
Zhang Zixing
Publication venue
Publication date: 01/01/2018
Field of study

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

arXiv.org e-Print Archive

OPUS Augsburg

Design of a Virtual Assistant to Improve Interaction Between the Audience and the Presenter

Author: Cobos-Guzman S.
De Miguel L.
Nuere S.
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 09/05/2022
Field of study

This article presents a novel design of a Virtual Assistant as part of a human-machine interaction system to improve communication between the presenter and the audience that can be used in education or general presentations for improving interaction during the presentations (e.g., auditoriums with 200 people). The main goal of the proposed model is the design of a framework of interaction to increase the level of attention of the public in key aspects of the presentation. In this manner, the collaboration between the presenter and Virtual Assistant could improve the level of learning among the public. The design of the Virtual Assistant relies on non-anthropomorphic forms with ‘live’ characteristics generating an intuitive and self-explainable interface. A set of intuitive and useful virtual interactions to support the presenter was designed. This design was validated from various types of the public with a psychological study based on a discrete emotions’ questionnaire confirming the adequacy of the proposed solution. The human-machine interaction system supporting the Virtual Assistant should automatically recognize the attention level of the audience from audiovisual resources and synchronize the Virtual Assistant with the presentation. The system involves a complex artificial intelligence architecture embracing perception of high-level features from audio and video, knowledge representation, and reasoning for pervasive and affective computing and reinforcement learning to teach the intelligent agent to decide on the best strategy to increase the level of attention of the audience

Re-UNIR

On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement

Author: Jensen Jesper
Kolbæk Morten
Tan Zheng-Hua
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/12/2018
Field of study

The majority of deep neural network (DNN) based speech enhancement algorithms rely on the mean-square error (MSE) criterion of short-time spectral amplitudes (STSA), which has no apparent link to human perception, e.g. speech intelligibility. Short-Time Objective Intelligibility (STOI), a popular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. This raises the question if a DNN training criterion based on envelope linear correlation (ELC) can lead to improved speech intelligibility performance of DNN based speech enhancement algorithms compared to algorithms based on the STSA-MSE criterion. In this paper we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically equivalent, and we provide empirical data to support our theoretical results. Furthermore, our experimental findings suggest that the standard STSA minimum-MSE estimator is near optimal, if the objective is to enhance noisy speech in a manner which is optimal with respect to the STOI speech intelligibility estimator

arXiv.org e-Print Archive

VBN

Deep neural network techniques for monaural speech enhancement: state of the art analysis

Author: Ochieng Peter
Publication venue
Publication date: 20/06/2023
Field of study

Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in these domains in task such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement domain to achieve denosing, dereverberation and multi-speaker separation in monaural speech enhancement. In this paper, we review some dominant DNN techniques being employed to achieve speech separation. The review looks at the whole pipeline of speech enhancement from feature extraction, how DNN based tools are modelling both global and local features of speech and model training (supervised and unsupervised). We also review the use of speech-enhancement pre-trained models to boost speech enhancement process. The review is geared towards covering the dominant trends with regards to DNN application in speech enhancement in speech obtained via a single speaker.Comment: conferenc

arXiv.org e-Print Archive

DeepWiVe: deep-learning-aided wireless video transmission

Author: Gunduz D
Tung T-Y
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/11/2021
Field of study

We present DeepWiVe , the first-ever end-to-end joint source-channel coding (JSCC) video transmission scheme that leverages the power of deep neural networks (DNNs) to directly map video signals to channel symbols, combining video compression, channel coding, and modulation steps into a single neural transform. Our DNN decoder predicts residuals without distortion feedback, which improves the video quality by accounting for occlusion/disocclusion and camera movements. We simultaneously train different bandwidth allocation networks for the frames to allow variable bandwidth transmission. Then, we train a bandwidth allocation network using reinforcement learning (RL) that optimizes the allocation of limited available channel bandwidth among video frames to maximize the overall visual quality. Our results show that DeepWiVe can overcome the cliff-effect , which is prevalent in conventional separation-based digital communication schemes, and achieve graceful degradation with the mismatch between the estimated and actual channel qualities. DeepWiVe outperforms H.264 video compression followed by low-density parity check (LDPC) codes in all channel conditions by up to 0.0485 in terms of the multi-scale structural similarity index measure (MS-SSIM), and H.265+ LDPC by up to 0.0069 on average. We also illustrate the importance of optimizing bandwidth allocation in JSCC video transmission by showing that our optimal bandwidth allocation policy is superior to uniform allocation as well as a heuristic policy benchmark

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia