478 research outputs found

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Wavelets, ridgelets and curvelets on the sphere

    Full text link
    We present in this paper new multiscale transforms on the sphere, namely the isotropic undecimated wavelet transform, the pyramidal wavelet transform, the ridgelet transform and the curvelet transform. All of these transforms can be inverted i.e. we can exactly reconstruct the original data from its coefficients in either representation. Several applications are described. We show how these transforms can be used in denoising and especially in a Combined Filtering Method, which uses both the wavelet and the curvelet transforms, thus benefiting from the advantages of both transforms. An application to component separation from multichannel data mapped to the sphere is also described in which we take advantage of moving to a wavelet representation.Comment: Accepted for publication in A&A. Manuscript with all figures can be downloaded at http://jstarck.free.fr/aa_sphere05.pd

    Recurrent neural networks for multi-microphone speech separation

    Get PDF
    This thesis takes the classical signal processing problem of separating the speech of a target speaker from a real-world audio recording containing noise, background interference — from competing speech or other non-speech sources —, and reverberation, and seeks data-driven solutions based on supervised learning methods, particularly recurrent neural networks (RNNs). Such speech separation methods can inject robustness in automatic speech recognition (ASR) systems and have been an active area of research for the past two decades. We particularly focus on applications where multi-channel recordings are available. Stand-alone beamformers cannot simultaneously suppress diffuse-noise and protect the desired signal from any distortions. Post-filters complement the beamformers in obtaining the minimum mean squared error (MMSE) estimate of the desired signal. Time-frequency (TF) masking — a method having roots in computational auditory scene analysis (CASA) — is a suitable candidate for post-filtering, but the challenge lies in estimating the TF masks. The use of RNNs — in particular the bi-directional long short-term memory (BLSTM) architecture — as a post-filter estimating TF masks for a delay-and-sum beamformer (DSB) — using magnitude spectral and phase-based features — is proposed. The data—recorded in 4 challenging realistic environments—from the CHiME-3 challenge is used. Two different TF masks — Wiener filter and log-ratio — are identified as suitable targets for learning. The separated speech is evaluated based on objective speech intelligibility measures: short-term objective intelligibility (STOI) and frequency-weighted segmental SNR (fwSNR). The word error rates (WERs) as reported by the previous state-of-the-art ASR back-end — when fed with the test data of the CHiME-3 challenge — are interpreted against the objective scores for understanding the relationships of the latter with the former. Overall, a consistent improvement in the objective scores brought in by the RNNs is observed compared to that of feed-forward neural networks and a baseline MVDR beamformer

    Design of a Simulator for Neonatal Multichannel EEG: Application to Time-Frequency Approaches for Automatic Artifact Removal and Seizure Detection

    Get PDF
    The electroencephalogram (EEG) is used to noninvasively monitor brain activities; it is the most utilized tool to detect abnormalities such as seizures. In recent studies, detection of neonatal EEG seizures has been automated to assist neurophysiologists in diagnosing EEG as manual detection is time consuming and subjective; however it still lacks the necessary robustness that is required for clinical implementation. Moreover, as EEG is intended to record the cerebral activities, extra-cerebral activities external to the brain are also recorded; these are called “artifacts” and can seriously degrade the accuracy of seizure detection. Seizures are one of the most common neurologic problems managed by hospitals occurring in 0.1%-0.5% livebirths. Neonates with seizures are at higher risk for mortality and are reported to be 55-70 times more likely to have severe cerebral-palsy. Therefore, early and accurate detection of neonatal seizures is important to prevent long-term neurological damage. Several attempts in modelling the neonatal EEG and artifacts have been done, but most did not consider the multichannel case. Furthermore, these models were used to test artifact or seizure detection separately, but not together. This study aims to design synthetic models that generate clean or corrupted multichannel EEG to test the accuracy of available artifact and seizure detection algorithms in a controlled environment. In this thesis, synthetic neonatal EEG model is constructed by using; single-channel EEG simulators, head model, 21-electrodes, and propagation equations, to produce clean multichannel EEG. Furthermore, neonatal EEG artifact model is designed using synthetic signals to corrupt EEG waveforms. After that, an automated EEG artifact detection and removal system is designed in both time and time-frequency domains. Artifact detection is optimised and removal performance is evaluated. Finally, an automated seizure detection technique is developed, utilising fused and extended multichannel features along a cross-validated SVM classifier. Results show that the synthetic EEG model mimics real neonatal EEG with 0.62 average correlation, and corrupted-EEG can degrade seizure detection average accuracy from 100% to 70.9%. They also show that using artifact detection and removal enhances the average accuracy to 89.6%, and utilising the extended features enhances it to 97.4% and strengthened its robustness.لمراقبة ورصد أنشطة واشارات المخ، دون الحاجة لأي عملیات (EEG) یستخدم الرسم أو التخطیط الكھربائي للدماغ للدماغجراحیة، وھي تعد الأداة الأكثر استخداما في الكشف عن أي شذوذأو نوبات غیر طبیعیة مثل نوبات الصرع. وقد أظھرت دراسات حدیثة، أن الكشف الآلي لنوبات حدیثي الولادة، ساعد علماء الفسیولوجیا العصبیة في تشخیص الاشارات الدماغیة بشكل أكبر من الكشف الیدوي، حیث أن الكشف الیدوي یحتاج إلى وقت وجھد أكبر وھوذو فعالیة أقل بكثیر، إلا أنھ لا یزال یفتقر إلى المتانة الضروریة والمطلوبة للتطبیق السریري.علاوة على ذلك؛ فكما یقوم الرسم الكھربائي بتسجیل الأنشطة والإشارات الدماغیة الداخلیة، فھو یسجل أیضا أي نشاط أو اشارات خارجیة، مما یؤدي إلى -(artifacts) :حدوث خلل في مدى دقة وفعالیة الكشف عن النوبات الدماغیة الداخلیة، ویطلق على تلك الاشارات مسمى (نتاج صنعي) . 0.5٪ولادة حدیثة في -٪تعد نوبات الصرع من أكثر المشكلات العصبیة انتشارا،ً وھي تصیب ما یقارب 0.1المستشفیات. حیث أن حدیثي الولادة المصابین بنوبات الصرع ھم أكثر عرضة للوفاة، وكما تشیر التقاریر الى أنھم 70مرة أكثر. لذا یعد الكشف المبكر والدقیق للنوبات الدماغیة -معرضین للإصابة بالشلل الدماغي الشدید بما یقارب 55لحدیثي الولادة مھم جدا لمنع الضرر العصبي على المدى الطویل. لقد تم القیام بالعدید من المحاولات التي كانتتھدف الى تصمیم نموذج التخطیط الكھربائي والنتاج الصنعي لدماغ حدیثي الولادة, إلا أن معظمھا لم یعر أي اھتمام الى قضیة تعدد القنوات. إضافة الى ذلك, استخدمت ھذه النماذج , كل على حدة, أو نوبات الصرع. تھدف ھذه الدراسة الى تصمیم نماذج مصطنعة من شأنھا (artifact) لإختبار كاشفات النتاج الصنعيأن تولد اشارات دماغیة متعددة القنوات سلیمة أو معطلة وذلك لفحص مدى دقة فعالیة خوارزمیات الكشف عن نوبات ضمن بیئة یمكن السیطرة علیھا. (artifact) الصرع و النتاج الصنعي في ھذه الأطروحة, یتكون نموذج الرسم الكھربائي المصطنع لحدیثي الولادة من : قناة محاكاة واحده للرسم الكھربائي, نموذج رأس, 21قطب كھربائي و معادلات إنتشار. حیث تھدف جمیعھا لإنتاج إشاراة سلیمة متعدده القنوات للتخطیط عن طریق استخدام اشارات مصطنعة (artifact) الكھربائي للدماغ.علاوة على ذلك, لقد تم تصمیم نموذجالنتاج الصنعيفي نطاقالوقت و (artifact) لإتلاف الرسم الكھربائي للدماغ. بعد ذلك تم انشاء برنامج لكشف و إزالةالنتاج الصناعينطاقالوقت و التردد المشترك. تم تحسین برنامج الكشف النتاج الصناعيالى ابعد ما یمكن بینما تمت عملیة تقییم أداء الإزالة. وفي الختام تم التمكن من تطویر تقنیة الكشف الآلي عن نوبات الصرع, وذلك بتوظیف صفات مدمجة و صفات الذي تم التأكد من صحتھ. (SVM) جدیدة للقنوات المتعددة لإستخدامھا للمصنفلقد أظھرت النتائج أن نموذج الرسم الكھربائي المصطنع لحدیثي الولادة یحاكي الرسمالكھربائي الحقیقي لحدیثي الولادة بمتوسط ترابط 0.62, و أنالرسم الكھربائي المتضرر للدماغ قد یؤدي الى حدوث ھبوطفي مدى دقة متوسط الكشف عن نوبات الصرع من 100%الى 70.9%. وقد أشارت أیضا الى أن استخدام الكشف والإزالة عن النتاج الصنعي (artifact) یؤدي الى تحسن مستوى الدقة الى نسبة 89.6 %, وأن توظیف الصفات الجدیدة للقنوات المتعددة یزید من تحسنھا لتصل الى نسبة 94.4 % مما یعمل على دعم متانتھا

    Subspace Hybrid MVDR Beamforming for Augmented Hearing

    Full text link
    Signal-dependent beamformers are advantageous over signal-independent beamformers when the acoustic scenario - be it real-world or simulated - is straightforward in terms of the number of sound sources, the ambient sound field and their dynamics. However, in the context of augmented reality audio using head-worn microphone arrays, the acoustic scenarios encountered are often far from straightforward. The design of robust, high-performance, adaptive beamformers for such scenarios is an on-going challenge. This is due to the violation of the typically required assumptions on the noise field caused by, for example, rapid variations resulting from complex acoustic environments, and/or rotations of the listener's head. This work proposes a multi-channel speech enhancement algorithm which utilises the adaptability of signal-dependent beamformers while still benefiting from the computational efficiency and robust performance of signal-independent super-directive beamformers. The algorithm has two stages. (i) The first stage is a hybrid beamformer based on a dictionary of weights corresponding to a set of noise field models. (ii) The second stage is a wide-band subspace post-filter to remove any artifacts resulting from (i). The algorithm is evaluated using both real-world recordings and simulations of a cocktail-party scenario. Noise suppression, intelligibility and speech quality results show a significant performance improvement by the proposed algorithm compared to the baseline super-directive beamformer. A data-driven implementation of the noise field dictionary is shown to provide more noise suppression, and similar speech intelligibility and quality, compared to a parametric dictionary.Comment: 14 pages, 10 figures, submitted for IEEE/ACM Transactions on Audio, Speech, and Language Processing on 23-Nov-202
    corecore