168 research outputs found

    An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

    Get PDF
    Recent speech enhancement research has shown that deep learning techniques are very effective in removing background noise. Many deep neural networks are being proposed, showing promising results for improving overall speech perception. The Deep Multilayer Perceptron, Convolutional Neural Networks, and the Denoising Autoencoder are well-established architectures for speech enhancement; however, choosing between different deep learning models has been mainly empirical. Consequently, a comparative analysis is needed between these three architecture types in order to show the factors affecting their performance. In this paper, this analysis is presented by comparing seven deep learning models that belong to these three categories. The comparison includes evaluating the performance in terms of the overall quality of the output speech using five objective evaluation metrics and a subjective evaluation with 23 listeners; the ability to deal with challenging noise conditions; generalization ability; complexity; and, processing time. Further analysis is then provided while using two different approaches. The first approach investigates how the performance is affected by changing network hyperparameters and the structure of the data, including the Lombard effect. While the second approach interprets the results by visualizing the spectrogram of the output layer of all the investigated models, and the spectrograms of the hidden layers of the convolutional neural network architecture. Finally, a general evaluation is performed for supervised deep learning-based speech enhancement while using SWOC analysis, to discuss the technique’s Strengths, Weaknesses, Opportunities, and Challenges. The results of this paper contribute to the understanding of how different deep neural networks perform the speech enhancement task, highlight the strengths and weaknesses of each architecture, and provide recommendations for achieving better performance. This work facilitates the development of better deep neural networks for speech enhancement in the future

    A Mixed Reality Approach for dealing with the Video Fatigue of Online Meetings

    Get PDF
    Much of the issue with video meetings is the lack of naturalistic cues, together with the feeling of being observed all the time. Video calls take away most body language cues, but because the person is still visible, your brain still tries to compute that non-verbal language. It means that you’re working harder, trying to achieve the impossible. This impacts data retention and can lead to participants feeling unnecessarily tired. This project aims to transform the way online meetings happen, by turning off the camera and simplifying the information that our brains need to compute, thus preventing ‘Zoom fatigue’. The immersive solution we are developing, iVXR, consists of cutting-edge augmented reality technology, natural language processing, speech to text technologies and sub-real-time hardware acceleration using high performance computing

    Mapping and Masking Targets Comparison using Different Deep Learning based Speech Enhancement Architectures

    Get PDF
    Mapping and Masking targets are both widely used in recent Deep Neural Network (DNN) based supervised speech enhancement. Masking targets are proved to have a positive impact on the intelligibility of the output speech, while mapping targets are found, in other studies, to generate speech with better quality. However, most of the studies are based on comparing the two approaches using the Multilayer Perceptron (MLP) architecture only. With the emergence of new architectures that outperform the MLP, a more generalized comparison is needed between mapping and masking approaches. In this paper, a complete comparison will be conducted between mapping and masking targets using four different DNN based speech enhancement architectures, to work out how the performance of the networks changes with the chosen training target. The results show that there is no perfect training target with respect to all the different speech quality evaluation metrics, and that there is a tradeoff between the denoising process and the intelligibility of the output speech. Furthermore, the generalization ability of the networks was evaluated, and it is concluded that the design of the architecture restricts the choice of the training target, because masking targets result in significant performance degradation for deep convolutional autoencoder architecture

    A Comparative Study of Time and Frequency Domain Approaches to Deep Learning based Speech Enhancement

    Get PDF
    Deep learning has recently made a breakthrough in the speech enhancement process. Some architectures are based on a time domain representation, while others operate in the frequency domain; however, the study and comparison of different networks working in time and frequency is not reported in the literature. In this paper, this comparison between time and frequency domain learning for five Deep Neural Network (DNN) based speech enhancement architectures is presented. The comparison covers the evaluation of the output speech using four objective evaluation metrics: PESQ, STOI, LSD, and SSNR increase. Furthermore, the complexity of the five networks was investigated by comparing the number of parameters and processing time for each architecture. Finally some of the factors that affect learning in time and frequency were discussed. The primary results of this paper show that fully connected based architectures generate speech with low overall perception when learning in the time domain. On the other hand, convolutional based designs give acceptable performance in both frequency and time domains. However, time domain implementations show an inferior generalization ability. Frequency domain based learning was proved to be better than time domain when the complex spectrogram is used in the training process. Additionally, feature extraction is also proved to be very effective in DNN based supervised speech enhancement, whether it is performed at the beginning, or implicitly by bottleneck layer features. Finally, it was concluded that the choice of the working domain is mainly restricted by the type and design of the architecture used

    Privacy preserving encrypted phonetic search of speech data

    Get PDF
    This paper presents a strategy for enabling speech recognition to be performed in the cloud whilst preserving the privacy of users. The approach advocates a demarcation of responsibilities between the client and server-side components for performing the speech recognition task. On the client-side resides the acoustic model, which symbolically encodes the audio and encrypts the data before uploading to the server. The server-side then employs searchable encryption to enable the phonetic search of the speech content. Some preliminary results for speech encoding and searchable encryption are presented

    A Framework for Augmented Reality Based Shared Experiences

    Get PDF
    Meetings occupy 40% of the average working day. According to the Wall Street Journal, CEOs spend 18 hours, Civil Servants spend 22 Hours, and the average office worker spends 16 hours per week in meetings. Meetings are where information is shared, discussions take place and the most important decisions are made. The outcome of meetings should be clearly understood actions, but this is rarely the case as comprehensive meeting minutes and action points are not often captured. Meetings become ineffective and time is wasted and travelling becomes the biggest obstacle and cost (both monetarily and environmentally). Video conferencing technology has been developed to provide a low-cost alternative to expensive, time-consuming meetings. However, the video conferencing user experience lacks naturalness, and this inhibits effective communication between the participants. The Augmented Reality (AR) shared experience application proposed in this work will be the next form of video conferencing

    A Roadmap for Privacy Preserving Speech Processing

    Get PDF
    This paper presents an overview of a strategy for enabling speech recognition to be performed in the cloud whilst preserving the privacy of users. The strategy advocates a demarcation of responsibilities between the client and server-side components for performing the speech recognition task. On the client-side resides the acoustic model, which symbolically encodes the audio and encrypts the data before uploading to the server. The server-side then employs searchable encryption-based language modelling to perform the speech recognition task. The paper details the proposed client-side acoustic model components, and the proposed server-side searchable encryption which will be the basis of the language modelling. Some preliminary results are presented, and potential problems and their solutions regarding the encrypted communication between client and server are discussed. Preliminary benchmarking results with acceleration of the client and server operations with GPGPU computing are also presented

    Establishing the safety of waterbirth for mothers and babies: a cohort study with nested qualitative component: the protocol for the POOL study

    Get PDF
    Introduction Approximately 60 000 (9/100) infants are born into water annually in the UK and this is likely to increase. Case reports identified infants with water inhalation or sepsis following birth in water and there is a concern that women giving birth in water may sustain more complex perineal trauma. There have not been studies large enough to show whether waterbirth increases these poor outcomes. The POOL Study (ISRCTN13315580) plans to answer the question about the safety of waterbirths among women who are classified appropriate for midwifery-led intrapartum care. Methods and analysis A cohort study with a nested qualitative component. Objectives will be answered using retrospective and prospective data captured in electronic National Health Service (NHS) maternity and neonatal systems. The qualitative component aims to explore factors influencing pool use and waterbirth; data will be gathered via discussion groups, interviews and case studies of maternity units. Ethics and dissemination The protocol has been approved by NHS Wales Research Ethics Committee (18/WA/0291) the transfer of identifiable data has been approved by Health Research Authority Confidentiality Advisory Group (18CAG0153). Study findings and innovative methodology will be disseminated through peer-reviewed journals, conferences and events. Results will be of interest to the general public, clinical and policy stakeholders in the UK and will be disseminated accordingly

    Oral steroids for hearing loss associated with otitis media with effusion in children aged 2–8 years: the OSTRICH RCT

    Get PDF
    Background Children with hearing loss associated with otitis media with effusion (OME) are commonly managed through surgical intervention, hearing aids or watchful waiting. A safe, inexpensive, effective medical treatment would enhance treatment options. Small, poorly conducted trials have found a short-term benefit from oral steroids. Objective To determine the clinical effectiveness and cost-effectiveness of a 7-day course of oral steroids in improving hearing at 5 weeks in children with persistent OME symptoms and current bilateral OME and hearing loss demonstrated by audiometry. Design Double-blind, individually randomised, placebo-controlled trial. Setting Ear, nose and throat outpatient or paediatric audiology and audiovestibular medicine clinics in Wales and England. Participants Children aged 2–8 years, with symptoms of hearing loss attributable to OME for at least 3 months, a diagnosis of bilateral OME made on the day of recruitment and audiometry-confirmed hearing loss. Interventions A 7-day course of oral soluble prednisolone, as a single daily dose of 20 mg for children aged 2–5 years or 30 mg for 6- to 8-year-olds, or matched placebo. Main outcome measures Acceptable hearing at 5 weeks from randomisation. Secondary outcomes comprised acceptable hearing at 6 and 12 months, tympanometry, otoscopic findings, health-care consultations related to OME and other resource use, proportion of children who had ventilation tube (grommet) surgery at 6 and 12 months, adverse effects, symptoms, functional health status, health-related quality of life, short- and longer-term cost-effectiveness. Results A total of 389 children were randomised. Satisfactory hearing at 5 weeks was achieved by 39.9% and 32.8% in the oral steroid and placebo groups, respectively (absolute difference of 7.1%, 95% confidence interval –2.8% to 16.8%; number needed to treat = 14). This difference was not statistically significant. The secondary outcomes were consistent with the picture of a small or no benefit, and we found no subgroups that achieved a meaningful benefit from oral steroids. The economic analysis showed that treatment with oral steroids was more expensive and accrued fewer quality-adjusted life-years than treatment as usual. However, the differences were small and not statistically significant, and the sensitivity analyses demonstrated large variation in the results. Conclusions OME in children with documented hearing loss and attributable symptoms for at least 3 months has a high rate of spontaneous resolution. Discussions about watchful waiting and other interventions will be enhanced by this evidence. The findings of this study suggest that any benefit from a short course of oral steroids for OME is likely to be small and of questionable clinical significance, and that the treatment is unlikely to be cost-effective and, therefore, their use cannot be recommended
    • 

    corecore