15 research outputs found

    Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

    Get PDF
    Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.Comment: 5 pages, 3 figures, 3 table

    Automated Identification of Failure Cases in Organ at Risk Segmentation Using Distance Metrics: A Study on CT Data

    Full text link
    Automated organ at risk (OAR) segmentation is crucial for radiation therapy planning in CT scans, but the generated contours by automated models can be inaccurate, potentially leading to treatment planning issues. The reasons for these inaccuracies could be varied, such as unclear organ boundaries or inaccurate ground truth due to annotation errors. To improve the model's performance, it is necessary to identify these failure cases during the training process and to correct them with some potential post-processing techniques. However, this process can be time-consuming, as traditionally it requires manual inspection of the predicted output. This paper proposes a method to automatically identify failure cases by setting a threshold for the combination of Dice and Hausdorff distances. This approach reduces the time-consuming task of visually inspecting predicted outputs, allowing for faster identification of failure case candidates. The method was evaluated on 20 cases of six different organs in CT images from clinical expert curated datasets. By setting the thresholds for the Dice and Hausdorff distances, the study was able to differentiate between various states of failure cases and evaluate over 12 cases visually. This thresholding approach could be extended to other organs, leading to faster identification of failure cases and thereby improving the quality of radiation therapy planning.Comment: 11 pages, 5 figures, 2 table

    3D konvolúciós neuronhálón és neurális vokóderen alapuló némabeszéd-interfész

    Get PDF
    A némabeszéd-interfészek célja beszédjel előállítása valamilyen, az artikulációs szervek mozgását rögzítő felvételből, például a nyelvmozgást tartalmazó ultrahang-videóból. Jelenleg erre a konverzióra a mély neuronhálókat alkalmazó megoldások tűnnek a legígéretesebbnek. Képek felismerésére már régóta alkalmazzák a konvolúciós neuronhálókat, a legjobb eredményt azonban akkor kaphatjuk, ha a videó egyes képkockáit nem külön-külön, hanem sorozatként dolgozzuk fel. Egy lehetséges megoldás erre, ha a képeket feldolgozó konvolúciós háló kimeneteinek sorozatát egy visszacsatolt neuronhálóval egyesítjük. Jelen cikkben viszont egy másik megoldással próbálkozunk, nevesül 3-dimenziós konvolúciós hálókat használunk, ahol a képek két dimenziója mellett az idő képezi a harmadik tengelyt. A 3D konvolúciós hálóknak is egy speciális változatát alkalmazzuk, amely a térbeli és időbeli konvolúciós lépéseket felbontott formában végzi el – ezt a fajta hálózatot sikeresen használták már más videófelismerési feladatokban is. Kísérleteinkben a 3D neuronháló némileg pontosabb eredményeket adott, mint a kombinált konvolúciós+visszacsatolt modell, ami azt mutatja, hogy ez a megközelítés alternatívája lehet a rekurrens hálókra épülő, általában lassabban és nehézkesebben tanítható modelleknek

    Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

    Get PDF
    For articulatory-to-acoustic mapping, typically only limited parallel training data is available, making it impossible to apply fully end-to-end solutions like Tacotron2. In this paper, we experimented with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve the final synthesis quality of ultrasound-based articulatory-to-acoustic mapping with a limited database. We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder. The articulatory-to-acoustic conversion contains three steps: 1) from a sequence of ultrasound tongue image recordings, a 3D convolutional neural network predicts the inputs of the pre-trained Tacotron2 model, 2) the Tacotron2 model converts this intermediate representation to an 80-dimensional mel-spectrogram, and 3) the WaveGlow model is applied for final inference. This generated speech contains the timing of the original articulatory data from the ultrasound recording, but the F0 contour and the spectral information is predicted by the Tacotron2 model. The F0 values are independent of the original ultrasound images, but represent the target speaker, as they are inferred from the pre-trained Tacotron2 model. In our experiments, we demonstrated that the synthesized speech quality is more natural with the proposed solutions than with our earlier model.Comment: accepted at SSW11. arXiv admin note: text overlap with arXiv:2008.0315
    corecore