8 research outputs found

    Modeling Sub-Band Information Through Discrete Wavelet Transform to Improve Intelligibility Assessment of Dysarthric Speech

    Get PDF
    The speech signal within a sub-band varies at a fine level depending on the type, and level of dysarthria. The Mel-frequency filterbank used in the computation process of cepstral coefficients smoothed out this fine level information in the higher frequency regions due to the larger bandwidth of filters. To capture the sub-band information, in this paper, four-level discrete wavelet transform (DWT) decomposition is firstly performed to decompose the input speech signal into approximation and detail coefficients, respectively, at each level. For a particular input speech signal, five speech signals representing different sub-bands are then reconstructed using inverse DWT (IDWT). The log filterbank energies are computed by analyzing the short-term discrete Fourier transform magnitude spectra of each reconstructed speech using a 30-channel Mel-filterbank. For each analysis frame, the log filterbank energies obtained across all reconstructed speech signals are pooled together, and discrete cosine transform is performed to represent the cepstral feature, here termed as discrete wavelet transform reconstructed (DWTR)- Mel frequency cepstral coefficient (MFCC). The i-vector based dysarthric level assessment system developed on the universal access speech corpus shows that the proposed DTWRMFCC feature outperforms the conventional MFCC and several other cepstral features reported for a similar task. The usages of DWTR- MFCC improve the detection accuracy rate (DAR) of the dysarthric level assessment system in the text and the speaker-independent test case to 60.094 % from 56.646 % MFCC baseline. Further analysis of the confusion matrices shows that confusion among different dysarthric classes is quite different for MFCC and DWTR-MFCC features. Motivated by this observation, a two-stage classification approach employing discriminating power of both kinds of features is proposed to improve the overall performance of the developed dysarthric level assessment system. The two-stage classification scheme further improves the DAR to 65.813 % in the text and speaker- independent test case

    Unattended acoustic sensor systems for noise monitoring in national parks

    Get PDF
    2017 Spring.Includes bibliographical references.Detection and classification of transient acoustic signals is a difficult problem. The problem is often complicated by factors such as the variety of sources that may be encountered, the presence of strong interference and substantial variations in the acoustic environment. Furthermore, for most applications of transient detection and classification, such as speech recognition and environmental monitoring, online detection and classification of these transient events is required. This is even more crucial for applications such as environmental monitoring as it is often done at remote locations where it is unfeasible to set up a large, general-purpose processing system. Instead, some type of custom-designed system is needed which is power efficient yet able to run the necessary signal processing algorithms in near real-time. In this thesis, we describe a custom-designed environmental monitoring system (EMS) which was specifically designed for monitoring air traffic and other sources of interest in national parks. More specifically, this thesis focuses on the capabilities of the EMS and how transient detection, classification and tracking are implemented on it. The Sparse Coefficient State Tracking (SCST) transient detection and classification algorithm was implemented on the EMS board in order to detect and classify transient events. This algorithm was chosen because it was designed for this particular application and was shown to have superior performance compared to other algorithms commonly used for transient detection and classification. The SCST algorithm was implemented on an Artix 7 FPGA with parts of the algorithm running as dedicated custom logic and other parts running sequentially on a soft-core processor. In this thesis, the partitioning and pipelining of this algorithm is explained. Each of the partitions was tested independently to very their functionality with respect to the overall system. Furthermore, the entire SCST algorithm was tested in the field on actual acoustic data and the performance of this implementation was evaluated using receiver operator characteristic (ROC) curves and confusion matrices. In this test the FPGA implementation of SCST was able to achieve acceptable source detection and classification results despite a difficult data set and limited training data. The tracking of acoustic sources is done through successive direction of arrival (DOA) angle estimation using a wideband extension of the Capon beamforming algorithm. This algorithm was also implemented on the EMS in order to provide real-time DOA estimates for the detected sources. This algorithm was partitioned into several stages with some stages implemented in custom logic while others were implemented as software running on the soft-core processor. Just as with SCST, each partition of this beamforming algorithm was verified independently and then a full system test was conducted to evaluate whether it would be able to track an airborne source. For the full system test, a model airplane was flown at various trajectories relative to the EMS and the trajectories estimated by the system were compared to the ground truth. Although in this test the accuracy of the DOA estimates could not be evaluated, it was show that the algorithm was able to approximately form the general trajectory of a moving source which is sufficient for our application as only a general heading of the acoustic sources is desired

    An acoustic-phonetic approach in automatic Arabic speech recognition

    Get PDF
    In a large vocabulary speech recognition system the broad phonetic classification technique is used instead of detailed phonetic analysis to overcome the variability in the acoustic realisation of utterances. The broad phonetic description of a word is used as a means of lexical access, where the lexicon is structured into sets of words sharing the same broad phonetic labelling. This approach has been applied to a large vocabulary isolated word Arabic speech recognition system. Statistical studies have been carried out on 10,000 Arabic words (converted to phonemic form) involving different combinations of broad phonetic classes. Some particular features of the Arabic language have been exploited. The results show that vowels represent about 43% of the total number of phonemes. They also show that about 38% of the words can uniquely be represented at this level by using eight broad phonetic classes. When introducing detailed vowel identification the percentage of uniquely specified words rises to 83%. These results suggest that a fully detailed phonetic analysis of the speech signal is perhaps unnecessary. In the adopted word recognition model, the consonants are classified into four broad phonetic classes, while the vowels are described by their phonemic form. A set of 100 words uttered by several speakers has been used to test the performance of the implemented approach. In the implemented recognition model, three procedures have been developed, namely voiced-unvoiced-silence segmentation, vowel detection and identification, and automatic spectral transition detection between phonemes within a word. The accuracy of both the V-UV-S and vowel recognition procedures is almost perfect. A broad phonetic segmentation procedure has been implemented, which exploits information from the above mentioned three procedures. Simple phonological constraints have been used to improve the accuracy of the segmentation process. The resultant sequence of labels are used for lexical access to retrieve the word or a small set of words sharing the same broad phonetic labelling. For the case of having more than one word-candidates, a verification procedure is used to choose the most likely one

    Interactively skimming recorded speech

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1994.Includes bibliographical references (p. 143-156).Barry Michael Arons.Ph.D

    Técnicas de análisis, caracterización y detección de señales de voz en entornos acústicos adversos

    Get PDF
    Este trabajo de Tesis ha abordado el objetivo de dar robustez y mejorar la Detección de Actividad de Voz en entornos acústicos adversos con el fin de favorecer el comportamiento de muchas aplicaciones vocales, por ejemplo aplicaciones de telefonía basadas en reconocimiento automático de voz, aplicaciones en sistemas de transcripción automática, aplicaciones en sistemas multicanal, etc. En especial, aunque se han tenido en cuenta todos los tipos de ruido, se muestra especial interés en el estudio de las voces de fondo, principal fuente de error de la mayoría de los Detectores de Actividad en la actualidad. Las tareas llevadas a cabo poseen como punto de partida un Detector de Actividad basado en Modelos Ocultos de Markov, cuyo vector de características contiene dos componentes: la energía normalizada y la variación de la energía. Las aportaciones fundamentales de esta Tesis son las siguientes: 1) ampliación del vector de características de partida dotándole así de información espectral, 2) ajuste de los Modelos Ocultos de Markov al entorno y estudio de diferentes topologías y, finalmente, 3) estudio e inclusión de nuevas características, distintas de las del punto 1, para filtrar los pulsos de pronunciaciones que proceden de las voces de fondo. Los resultados de detección, teniendo en cuenta los tres puntos anteriores, muestran con creces los avances realizados y son significativamente mejores que los resultados obtenidos, bajo las mismas condiciones, con otros detectores de actividad de referencia. This work has been focused on improving the robustness at Voice Activity Detection in adverse acoustic environments in order to enhance the behavior of many vocal applications, for example telephony applications based on automatic speech recognition, automatic transcription applications, multichannel systems applications, and so on. In particular, though all types of noise have taken into account, this research has special interest in the study of pronunciations coming from far-field speakers, the main error source of most activity detectors today. The tasks carried out have, as starting point, a Hidden Markov Models Voice Activity Detector which a feature vector containing two components: normalized energy and delta energy. The key points of this Thesis are the following: 1) feature vector extension providing spectral information, 2) Hidden Markov Models adjustment to environment and study of different Hidden Markov Model topologies and, finally, 3) study and inclusion of new features, different from point 1, to reject the pronunciations coming from far-field speakers. Detection results, taking into account the above three points, show the advantages of using this method and are significantly better than the results obtained under the same conditions by other well-known voice activity detectors

    Adaptation of reference patterns in word-based speech recognition

    Get PDF
    corecore