5 research outputs found

    Ultrasound based Silent Speech Interface using Deep Learning

    Get PDF
    Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of Csapó et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of Csapó et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) és una tecnologia capaç de sintetitzar veu partint únicament de senyals no-acústiques. Pot tenir gran utilitat en casos com pacients de laringectomia, ambients sorollosos o trucades silencioses. Aquesta tèsis explora el cas particular de SSI utilitzant imatges de la llengua captades amb ultrasons com a senyals d'entrada. Es proposa un enfocament de 'síntesis directa' basat en Xarxes Neuronals Profundes i coeficients Mel-generalized cepstral. Aquest document és una extensió del treball de Csapó et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface" . Diversos models de xarxes neuronals són presentats i discutits, com les bàsiques xarxes neuronals directes, xarxes neuronals convolucionals o xarxes neuronals recurrents. També s'ha estudiat un pre-processat reductor de soroll basat en un Autoencoder convolucional profund. S'ha portat a terme un nombre considerable d'experiments utilitzant diverses arquitectures de Deep Learning, així com un extens estudi d'optimització d'hyperparàmetres. Els diferents experiments han estat evaluar i qualificar a partir de diferentes mesures de qualitat objectives i subjectives. Els millors resultats, tant en termes objectius com subjectius, els ha presentat una arquitectura basada en una CNN i capes bidireccionals de LSTMs

    Phonological and phonetic properties of nasal substitution in Sasak and Javanese

    Get PDF
    Austronesian languages such as Sasak and Javanese have a pattern of morphological nasal substitution, where nasals alternate with homorganic oral obstruents—except that [s] is described as alternating with [ɲ], not with [n]. This appears to be an abstract morphophonological relation between [s] and [ɲ] where other parts of the paradigm have a concrete homorganic relation. Articulatory ultrasound data were collected of productions of [t, n, ʨ, ɲ], along with [s] and its nasal counterpart from two languages, from 10 Sasak and 8 Javanese speakers. Comparisons of lingual contours using a root mean square analysis were evaluated with linear mixed-effects regression models, a method that proves reliable for testing questions of phonological neutralization. In both languages, [t, n, s] exhibit a high degree of articulatory similarity, whereas postalveolar [ʨ] and its nasal counterpart [ɲ] exhibited less similarity. The nasal counterpart of [s] was identical in articulation to [ɲ]. This indicates an abstract, rather than concrete, relationship between [s] and its morphophonological nasal counterpart, with the two sounds not sharing articulatory place in either Sasak or Javanese.published_or_final_versio

    Fully-automated tongue detection in ultrasound images

    Get PDF
    Tracking the tongue in ultrasound images provides information about its shape and kinematics during speech. In this thesis, we propose engineering solutions to better exploit the existing frameworks and deploy them to convert a semi-automatic tongue contour tracking system to a fully-automatic one. Current methods for detecting/tracking the tongue require manual initialization or training using large amounts of labeled images. This work introduces a new method for extracting tongue contours in ultrasound images that requires no training nor manual intervention. The method consists in: (1) application of a phase symmetry filter to highlight regions possibly containing the tongue contour; (2) adaptive thresholding and rank ordering of grayscale intensities to select regions that include or are near the tongue contour; (3) skeletonization of these regions to extract a curve close to the tongue contour and (4) initialization of an accurate active contour from this curve. Two novel quality measures were also developed that predict the reliability of the method so that optimal frames can be chosen to confidently initialize fully automated tongue tracking. This is achieved by automatically generating and choosing a set of points that can replace the manually segmented points for a semi-automated tracking approach. To improve the accuracy of tracking, this work also incorporates two criteria to re-set the tracking approach from time to time so the entire tracking result does not depend on human refinements. Experiments were run on 16 free speech ultrasound recordings from healthy subjects and subjects with articulatory impairments due to Steinert’s disease. Fully automated and semi automated methods result in mean sum of distances errors of 1.01mm±0.57mm and 1.05mm± 0.63mm, respectively, showing that the proposed automatic initialization does not significantly alter accuracy. Moreover, the experiments show that the accuracy would improve with the proposed re-initialization (mean sum of distances error of 0.63mm±0.35mm)

    Tongue tracking in ultrasound images with active appearance models

    No full text
    Tongue Ultrasound imaging is widely used for human speech production analysis and modeling. In this paper, we propose a novel method to automatically detect and track the tongue contour in Ultrasound (US) videos. Our method is built on a variant of Active Appearance Modeling. It incorporates shape prior information and can estimate the entire tongue contour robustly and accurately in a sequence of US frames. Experimental evaluation demonstrates the effectiveness of our approach and its improved performance compared to previously proposed tongue tracking techniques. 1
    corecore