1,978 research outputs found

    An overview & analysis of sequence-to-sequence emotional voice conversion

    Get PDF
    Emotional voice conversion (EVC) focuses on converting a speech utterance from a source to a target emotion; it can thus be a key enabling technology for human-computer interaction applications and beyond. However, EVC remains an unsolved research problem with several challenges. In particular, as speech rate and rhythm are two key factors of emotional conversion, models have to generate output sequences of differing length. Sequence-to-sequence modelling is recently emerging as a competitive paradigm for models that can overcome those challenges. In an attempt to stimulate further research in this promising new direction, recent sequence-to-sequence EVC papers were systematically investigated and reviewed from six perspectives: their motivation, training strategies, model architectures, datasets, model inputs, and evaluation methods. This information is organised to provide the research community with an easily digestible overview of the current state-of-the-art. Finally, we discuss existing challenges of sequence-to-sequence EVC

    골밀도 측정을 위한 적대적생성신경망 기반 정량적 CBCT 측정

    Get PDF
    학위논문(석사) -- 서울대학교대학원 : 융합과학기술대학원 응용바이오공학과, 2021.8. 용태훈.The purpose of this study was to directly and quantitatively measure BMD from Cone-beam CT (CBCT) images by enhancing the linearity and uniformity of the bone intensities based on a hybrid deep-learning model (QCBCT-NET) of combining the generative adversarial network (Cycle-GAN) and U-Net, and to compare the bone images enhanced by the QCBCT-NET with those by Cycle-GAN and U-Net. We used two phantoms of human skulls encased in acrylic, one for the training and validation datasets, and the other for the test dataset. We proposed the QCBCT-NET consisting of Cycle-GAN with residual blocks and a multi-channel U-Net using paired training data of quantitative CT (QCT) and CBCT images. The BMD images produced by QCBCT-NET significantly outperformed the images produced by the Cycle-GAN or the U-Net in mean absolute difference (MAD), peak signal to noise ratio (PSNR), normalized cross-correlation (NCC), structural similarity (SSIM), and linearity when compared to the original QCT image. The QCBCT-NET improved the contrast of the bone images by reflecting the original BMD distribution of the QCT image locally using the Cycle-GAN, and also spatial uniformity of the bone images by globally suppressing image artifacts and noise using the two-channel U-Net. The QCBCT-NET substantially enhanced the linearity, uniformity, and contrast as well as the anatomical and quantitative accuracy of the bone images, and demonstrated more accuracy than the Cycle-GAN and the U-Net for quantitatively measuring BMD in CBCT.골다공증은 골의 밀도가 낮아 쉽게 골절되는 골격계 질환이며, 골다공증은 그 자체만으로는 거의 증상을 일으키지 않고 뼈가 부러져서 골다공증을 발견하게 되는 경우가 많다. 골다공증을 진단하고 향후 골절 위험을 예측하기 위한 방법으로 골밀도(bone mineral density)를 측정하는 방법이 있다. BMD 측정은 사람의 골의 밀도를 추정하는 직접적인 방법이고, 정확한 골의 밀도 측정은 매우 중요하다. 골밀도를 측정하기 위해서는 CT 스캔 시, 골밀도 팬텀을 함께 스캔한 후, QCT 방법을 사용하여 CT 영상의 Hounsfield Units 으로부터 BMD를 정량적으로 계산하게 된다. 환자와 함께 촬영한 골밀도 BMD 팬텀을 통해 Hounsfield Units과 BMD간의 선형 관계를 측정함으로서 평가할 수 있다. 최근 CBCT는 MDCT에 비해 낮은 방사선량과 짧은 획득 시간, 그리고 더 높은 해상도 등 다양한 장점을 포함하여 많은 이점을 제공하기 때문에 치과 치료 및 계획에 널리 사용되고 있지만, CBCT 시스템의 복셀값은 임의적이며 정확한 HU를 구할 수 없으므로 골밀도에 대한 평가를 허용하지 않는 단점이 있다. CBCT로부터 정확한 BMD를 측정하기 위해서는 균일하고 정확한, 높은 퀄리티의 CBCT 영상을 필요로 한다. 본 연구에서는 적대적 생성 신경망(Cycle-GAN)과 인코더 및 디코더 구조의 U-Net을 결합한 하이브리드 딥러닝 모델(QCBCT-NET)을 기반으로 골 영역의 선형성과 균일성을 향상시켜 Cone-beam CT(CBCT) 영상에서 BMD를 직접적 그리고 정량적으로 측정하며, QCBCT-NET에 의해 선형성과 균일성이 강화된 골 이미지와 기존의 영상 생성 분야에서 state-of-the-art를 기록한 Cycle-GAN 및 U-Net에 의해 강화된 골 영상을 비교하였다. 아크릴로 둘러싸인 인간 두개골의 두 팬텀을 사용하였고, 하나는 훈련 및 검증 데이터 세트로서 금속 수복물이 없는 팬텀이고, 다른 하나는 테스트 데이터 세트용으로서 금속 수복물을 포함하여 영상의 잡음을 발생시키는 팬텀이다. 우리는 QCT(quantitative CT)와 CBCT 이미지의 짝을 이룬 학습 데이터를 사용하여 잔차(residual) 블록을 포함하는 Cycle-GAN과 다중 채널 U-Net으로 구성된 QCBCT-NET을 제안하였다. QCBCT-NET에 의해 생성된 BMD 이미지는 평균 절대차(MAD), 최대 신호 대 잡음비(PSNR), 정규화된 교차 상관 계수(NCC), 구조적 유사성(SSIM) 그리고 선형성(linearity)에서 Cycle-GAN 또는 U-Net에 의해 생성된 이미지보다 훨씬 우수한 성능을 보였다. QCBCT-NET은 Cycle-GAN을 사용하여 QCT 영상의 원래 BMD 분포를 국부적으로 반영하여 골 영상의 대비도(contrast)를 향상시켰고, 다중채널 U-Net을 사용하여 영상 잡음과 노이즈를 전역적으로 억제하여 골 영상의 공간적 균일성(uniformity)을 향상시켰다. QCBCT-NET은 골 영상의 해부학적 및 정량적 정확도뿐만 아니라 선형성, 균일성, 대비도를 크게 향상시켰으며 CBCT에서 BMD를 정량적으로 측정하는 Cycle-GAN 및 U-Net보다 더 높은 정확도를 보여주었다.Introduction 1 Materials and Methods 5 Results 25 Discussion 37 Conclusions 41 References 42 Abstract in Korean 55석

    Denoising diffusion-based MR to CT image translation enables whole spine vertebral segmentation in 2D and 3D without manual annotations

    Full text link
    Background: Automated segmentation of spinal MR images plays a vital role both scientifically and clinically. However, accurately delineating posterior spine structures presents challenges. Methods: This retrospective study, approved by the ethical committee, involved translating T1w and T2w MR image series into CT images in a total of n=263 pairs of CT/MR series. Landmark-based registration was performed to align image pairs. We compared 2D paired (Pix2Pix, denoising diffusion implicit models (DDIM) image mode, DDIM noise mode) and unpaired (contrastive unpaired translation, SynDiff) image-to-image translation using "peak signal to noise ratio" (PSNR) as quality measure. A publicly available segmentation network segmented the synthesized CT datasets, and Dice scores were evaluated on in-house test sets and the "MRSpineSeg Challenge" volumes. The 2D findings were extended to 3D Pix2Pix and DDIM. Results: 2D paired methods and SynDiff exhibited similar translation performance and Dice scores on paired data. DDIM image mode achieved the highest image quality. SynDiff, Pix2Pix, and DDIM image mode demonstrated similar Dice scores (0.77). For craniocaudal axis rotations, at least two landmarks per vertebra were required for registration. The 3D translation outperformed the 2D approach, resulting in improved Dice scores (0.80) and anatomically accurate segmentations in a higher resolution than the original MR image. Conclusion: Two landmarks per vertebra registration enabled paired image-to-image translation from MR to CT and outperformed all unpaired approaches. The 3D techniques provided anatomically correct segmentations, avoiding underprediction of small structures like the spinous process.Comment: 35 pages, 7 figures, Code and a model weights available https://doi.org/10.5281/zenodo.8221159 and https://doi.org/10.5281/zenodo.819869

    Personal recommender system via convolutional autoencoder with conditioning augmentation : Recommender system and representation learning with convolutional autoencoder

    Get PDF
    Department of Industrial EngineeringRecently, volume of various types of information, including reviews, images, and videos containing sound, also known as unstructured data, have been increased in an explosive manner . Even though, unstructured data is difficult to utilize without proper preprocessing, many applications adopted it as a source of information to extract value from it, such as recommender systems, natural language processing and computer visions . The recent prosperity of deep learning techniques has accelerated the progress in this field by making the preprocessing parts and feature extraction parts simple and easy . Moreover, the emergence of generative adversarial network has led to improved general performance of unsupervised learning models which makes many applications to make use of diverse forms of data Along with this context, concerns of this article focused on (i) developing a recommender system based on modified autoencoder which is a typical deep learning technique applied to this research field that presents exceptional performance in the feature extraction process and (ii) applying data augmentation to this field which is frequently used to deal with data scarcity or limitation problem which is one of the main challenges of recommender systems [3].(iii) Lastly, the proposed model can be applied to both tasks, collaborative filtering and contents-based filtering proved to present compliant performance Modified convolutional autoencoder-based recommender system learns features of samples that represented by reviews of users or user-item rating matrices. The proposed model takes vanilla autoencoder as a base structure conbined with convolutional layer to extract feature that takes encoded vector as input which is represented by preprocessed reviews or ratings. Afterward, the conditioning augmentation process, which is an augmentation technique for embedded vectors, is applied that goes through a decoder that produces final predictions based on the encoded input vector from the previous encoder The contribution of the paper can be summarized as three points. (i)Conditioning augmentation which is data augmentation technique utilizing encoded vector to deal with data scarcity problem that is mainly concerned in recommender system field. (ii)Proposed model can take both types of inputs which are contents-based encoded vector or ratingbased encoded vector. Contents-based vector can represent features of item such as review, quality, and various types of categorical feature to corresponding product. Rating-based vector indicates evaluation of consumer based on numeric value. (iii)Lastly, performance of proposed model is compliant compared to state of the arts of several bench mark open dataset.ope

    Efficient, end-to-end and self-supervised methods for speech processing and generation

    Get PDF
    Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperació de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucció de parts del senyal que s’han malmès, com extensió d’ample de banda i recuperació de seccions temporals perdudes, entre d’altres. En aquesta última aplicació del model, el fet d’incloure funcions de pèrdua acústicament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu característiques acústiques a la sortida de la xarxa discriminadora de la nostra GAN. També es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversària i la qualitat generada finalment després d’afegir les funcions acústiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE és un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informació abstracta com identitat del parlant, les característiques prosòdiques i els continguts lingüístics. També es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’àmbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emoció i de la parla, mostrant-se efectiu especialment si s’ajusta la representació de manera supervisada amb un conjunt de dades d’adaptació

    Efficient, end-to-end and self-supervised methods for speech processing and generation

    Get PDF
    Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperació de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucció de parts del senyal que s’han malmès, com extensió d’ample de banda i recuperació de seccions temporals perdudes, entre d’altres. En aquesta última aplicació del model, el fet d’incloure funcions de pèrdua acústicament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu característiques acústiques a la sortida de la xarxa discriminadora de la nostra GAN. També es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversària i la qualitat generada finalment després d’afegir les funcions acústiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE és un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informació abstracta com identitat del parlant, les característiques prosòdiques i els continguts lingüístics. També es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’àmbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emoció i de la parla, mostrant-se efectiu especialment si s’ajusta la representació de manera supervisada amb un conjunt de dades d’adaptació.Postprint (published version

    The \u3ci\u3eman\u3csub\u3ei\u3c/sub\u3e said she\u3csub\u3ei\u3c/sub\u3e would return\u3c/i\u3e: English pronominal gender in native Mandarin speaking learners, examined within a comprehensive theory of language acquisition

    Get PDF
    The project that led to this honors thesis was begun in the Fall semester of 2010, in a graduate-level psycholinguistics course taught by Dr. T. Daniel Seely. At that time, I was intensively studying Mandarin and had been living with a native speaker who was also in the process of learning English. The types of speech errors in her English, particularly the ones that appeared to have resulted from influence from her native Mandarin, interested me greatly. One of the most striking errors that she tended to make, however, was mis-matching English gender-marked pronouns with the gender of the referent. That is, she would frequently say things like The man driving the bus said she could bring me to Ann Arbor, or I love Lady Gaga, his style is so interesting
    corecore