2,248 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Design of a Virtual Assistant to Improve Interaction Between the Audience and the Presenter
This article presents a novel design of a Virtual Assistant as part of a human-machine interaction system to improve communication between the presenter and the audience that can be used in education or general presentations for improving interaction during the presentations (e.g., auditoriums with 200 people). The main goal of the proposed model is the design of a framework of interaction to increase the level of attention of the public in key aspects of the presentation. In this manner, the collaboration between the presenter and Virtual Assistant could improve the level of learning among the public. The design of the Virtual Assistant relies on non-anthropomorphic forms with ‘live’ characteristics generating an intuitive and self-explainable interface. A set of intuitive and useful virtual interactions to support the presenter was designed. This design was validated from various types of the public with a psychological study based on a discrete emotions’ questionnaire confirming the adequacy of the proposed solution. The human-machine interaction system supporting the Virtual Assistant should automatically recognize the attention level of the audience from audiovisual resources and synchronize the Virtual Assistant with the presentation. The system involves a complex artificial intelligence architecture embracing perception of high-level features from audio and video, knowledge representation, and reasoning for pervasive and affective computing and reinforcement learning to teach the intelligent agent to decide on the best strategy to increase the level of attention of the audience
On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement
The majority of deep neural network (DNN) based speech enhancement algorithms
rely on the mean-square error (MSE) criterion of short-time spectral amplitudes
(STSA), which has no apparent link to human perception, e.g. speech
intelligibility. Short-Time Objective Intelligibility (STOI), a popular
state-of-the-art speech intelligibility estimator, on the other hand, relies on
linear correlation of speech temporal envelopes. This raises the question if a
DNN training criterion based on envelope linear correlation (ELC) can lead to
improved speech intelligibility performance of DNN based speech enhancement
algorithms compared to algorithms based on the STSA-MSE criterion. In this
paper we derive that, under certain general conditions, the STSA-MSE and ELC
criteria are practically equivalent, and we provide empirical data to support
our theoretical results. Furthermore, our experimental findings suggest that
the standard STSA minimum-MSE estimator is near optimal, if the objective is to
enhance noisy speech in a manner which is optimal with respect to the STOI
speech intelligibility estimator
Deep neural network techniques for monaural speech enhancement: state of the art analysis
Deep neural networks (DNN) techniques have become pervasive in domains such
as natural language processing and computer vision. They have achieved great
success in these domains in task such as machine translation and image
generation. Due to their success, these data driven techniques have been
applied in audio domain. More specifically, DNN models have been applied in
speech enhancement domain to achieve denosing, dereverberation and
multi-speaker separation in monaural speech enhancement. In this paper, we
review some dominant DNN techniques being employed to achieve speech
separation. The review looks at the whole pipeline of speech enhancement from
feature extraction, how DNN based tools are modelling both global and local
features of speech and model training (supervised and unsupervised). We also
review the use of speech-enhancement pre-trained models to boost speech
enhancement process. The review is geared towards covering the dominant trends
with regards to DNN application in speech enhancement in speech obtained via a
single speaker.Comment: conferenc
DeepWiVe: deep-learning-aided wireless video transmission
We present DeepWiVe , the first-ever end-to-end joint source-channel coding (JSCC) video transmission scheme that leverages the power of deep neural networks (DNNs) to directly map video signals to channel symbols, combining video compression, channel coding, and modulation steps into a single neural transform. Our DNN decoder predicts residuals without distortion feedback, which improves the video quality by accounting for occlusion/disocclusion and camera movements. We simultaneously train different bandwidth allocation networks for the frames to allow variable bandwidth transmission. Then, we train a bandwidth allocation network using reinforcement learning (RL) that optimizes the allocation of limited available channel bandwidth among video frames to maximize the overall visual quality. Our results show that DeepWiVe can overcome the cliff-effect , which is prevalent in conventional separation-based digital communication schemes, and achieve graceful degradation with the mismatch between the estimated and actual channel qualities. DeepWiVe outperforms H.264 video compression followed by low-density parity check (LDPC) codes in all channel conditions by up to 0.0485 in terms of the multi-scale structural similarity index measure (MS-SSIM), and H.265+ LDPC by up to 0.0069 on average. We also illustrate the importance of optimizing bandwidth allocation in JSCC video transmission by showing that our optimal bandwidth allocation policy is superior to uniform allocation as well as a heuristic policy benchmark
- …