139,152 research outputs found

    Deep Learning Enabled Semantic Communications with Speech Recognition and Synthesis

    Get PDF
    In this paper, we develop a deep learning based semantic communication system for speech transmission, named DeepSC-ST. We take the speech recognition and speech synthesis as the transmission tasks of the communication system, respectively. First, the speech recognition-related semantic features are extracted for transmission by a joint semantic-channel encoder and the text is recovered at the receiver based on the received semantic features, which significantly reduces the required amount of data transmission without performance degradation. Then, we perform speech synthesis at the receiver, which dedicates to re-generate the speech signals by feeding the recognized text and the speaker information into a neural network module. To enable the DeepSC-ST adaptive to dynamic channel environments, we identify a robust model to cope with different channel conditions. According to the simulation results, the proposed DeepSC-ST significantly outperforms conventional communication systems and existing DL-enabled communication systems, especially in the low signal-to-noise ratio (SNR) regime. A software demonstration is further developed as a proof-of-concept of the DeepSC-ST

    German-Arabic Speech-to-Speech Translation for Psychiatric Diagnosis

    Get PDF
    In this paper we present the Arabic related natural language processing components of our German–Arabic speech-to-speech translation system which is being deployed in the context of interpretation during psychiatric, diagnostic interviews. For this purpose we have built a pipelined speech-to-speech translation system consisting of automatic speech recognition, machine translation, text post-processing, and speech synthesis systems. We have implemented two pipelines, from German to Arabic and vice versa, to conduct interpreted two-way dialogues between psychiatrists and potential patients. All systems in our pipeline have been realized as all-neural end-to-end systems, using different architectures suitable for the different components. The speech recognition systems use an encoder/decoder + attention architecture, the machine translation system is based on the Transformer architecture, the post-processing for Arabic employs a sequence-tagger for diacritization, and for the speech synthesis systems we use Tacotron 2 for generating spectrograms and WaveGlow as a vocoder. The speech translation is deployed in a server-based speech translation application that implements a turn-based translation between a German-speaking psychiatrist administrating the Mini-International Neuropsychiatric Interview (M.I.N.I.) and an Arabic speaking person answering the interview. As this is a very specific domain, in addition to the linguistic challenges posed by translating between Arabic and German, we also focus in this paper on the methods we implemented for adapting our speech to speech translation system to the domain of this psychiatric interview

    Voice Based Email System

    Get PDF
    As the technology advances the applications available for users can be made more user-friendly. This Voice Based Email System is developed for the people who require comfort and who are physically challenged. With the advent of technology, many technological solutions have been implemented so that people get benefited by utilizing them. Considering it as a key idea we propose to develop an application, Voice Based Email System which will be useful for every person to access the email functionalities in a hassle free manner. We have used 'Text to Speech' and �Speech to Text� voice converter named �Speech Recognition Anywhere� to facilitate sending and reading of emails. The speech synthesis can read aloud any written text avoiding eye strain and save time reading on computer. The existing email system, its drawbacks and our proposed methodology to overcome them have been discussed in this paper. Related work that has been done already is referred and taken as a guideline to finish our system

    Statistical text-to-speech synthesis of Spanish subtitles

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-13623-3_5Online multimedia repositories are growing rapidly. However, language barriers are often difficult to overcome for many of the current and potential users. In this paper we describe a TTS Spanish sys- tem and we apply it to the synthesis of transcribed and translated video lectures. A statistical parametric speech synthesis system, in which the acoustic mapping is performed with either HMM-based or DNN-based acoustic models, has been developed. To the best of our knowledge, this is the first time that a DNN-based TTS system has been implemented for the synthesis of Spanish. A comparative objective evaluation between both models has been carried out. Our results show that DNN-based systems can reconstruct speech waveforms more accurately.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures) and ICT Policy Support Programme (ICT PSP/2007-2013) as part of the Competitiveness and Innovation Framework Programme (CIP) under grant agreement no 621030 (EMMA), and the Spanish MINECO Active2Trans (TIN2012-31723) research project.Piqueras Gozalbes, SR.; Del Agua Teba, MA.; Giménez Pastor, A.; Civera Saiz, J.; Juan Císcar, A. (2014). Statistical text-to-speech synthesis of Spanish subtitles. En Advances in Speech and Language Technologies for Iberian Languages: Second International Conference, IberSPEECH 2014, Las Palmas de Gran Canaria, Spain, November 19-21, 2014. Proceedings. Springer International Publishing. 40-48. https://doi.org/10.1007/978-3-319-13623-3_5S4048Ahocoder, http://aholab.ehu.es/ahocoderCoursera, http://www.coursera.orgHMM-Based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jpKhan Academy, http://www.khanacademy.orgAxelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proc. of EMNLP, pp. 355–362 (2011)Bottou, L.: Stochastic gradient learning in neural networks. In: Proceedings of Neuro-Nîmes 1991. EC2, Nimes, France (1991)Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20(1), 30–42 (2012)Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing 8(2), 184–194 (2014)Fan, Y., Qian, Y., Xie, F., Soong, F.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proc. of Interspeech (submitted 2014)Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29(6), 82–97 (2012)Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. of ICASSP, vol. 1, pp. 373–376 (1996)King, S.: Measuring a decade of progress in text-to-speech. Loquens 1(1), e006 (2014)Koehn, P.: Statistical Machine Translation. Cambridge University Press (2010)Kominek, J., Schultz, T., Black, A.W.: Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. In: Proc. of SLTU, pp. 63–68 (2008)Lopez, A.: Statistical machine translation. ACM Computing Surveys 40(3), 8:1–8:49 (2008)poliMedia: The polimedia video-lecture repository (2007), http://media.upv.esSainz, I., Erro, D., Navas, E., Hernáez, I., Sánchez, J., Saratxaga, I.: Aholab speech synthesizer for albayzin 2012 speech synthesis evaluation. In: Proc. of IberSPEECH, pp. 645–652 (2012)Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent dnn for conversational speech transcription. In: Proc. of ASRU, pp. 24–29 (2011)Shinoda, K., Watanabe, T.: MDL-based context-dependent subword modeling for speech recognition. Journal of the Acoustical Society of Japan 21(2), 79–86 (2000)Silvestre-Cerdà, J.A., et al.: Translectures. In: Proc. of IberSPEECH, pp. 345–351 (2012)TED Ideas worth spreading, http://www.ted.comThe transLectures-UPV Team.: The transLectures-UPV toolkit (TLK), http://translectures.eu/tlkToda, T., Black, A.W., Tokuda, K.: Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis. In: Proc. of ISCA Speech Synthesis Workshop (2004)Tokuda, K., Kobayashi, T., Imai, S.: Speech parameter generation from hmm using dynamic features. In: Proc. of ICASSP, vol. 1, pp. 660–663 (1995)Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Transactions on Information and Systems 85(3), 455–464 (2002)transLectures: D3.1.2: Second report on massive adaptation, http://www.translectures.eu/wp-content/uploads/2014/01/transLectures-D3.1.2-15Nov2013.pdfTurró, C., Ferrando, M., Busquets, J., Cañero, A.: Polimedia: a system for successful video e-learning. In: Proc. of EUNIS (2009)Videolectures.NET: Exchange ideas and share knowledge, http://www.videolectures.netWu, Y.J., King, S., Tokuda, K.: Cross-lingual speaker adaptation for HMM-based speech synthesis. In: Proc. of ISCSLP, pp. 1–4 (2008)Yamagishi, J.: An introduction to HMM-based speech synthesis. Tech. rep. Centre for Speech Technology Research (2006), https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/TrajectoryModelling/HTS-Introduction.pdfYoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proc. of Eurospeech, pp. 2347–2350 (1999)Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: Proc. of ICASSP, pp. 3872–3876 (2014)Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proc. of ICASSP, pp. 7962–7966 (2013)Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Communication 51(11), 1039–1064 (2009

    Measuring the gap between HMM-based ASR and TTS

    Get PDF
    The EMIME European project is conducting research in the development of technologies for mobile, personalised speech-to-speech translation systems. The hidden Markov model is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components, thus, the investigation of unified statistical modelling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems, measuring their performance with respect to phone set and lexicon, acoustic feature type and dimensionality and HMM topology. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modelling approaches

    Speech Processes for Brain-Computer Interfaces

    Get PDF
    Speech interfaces have become widely used and are integrated in many applications and devices. However, speech interfaces require the user to produce intelligible speech, which might be hindered by loud environments, concern to bother bystanders or the general in- ability to produce speech due to disabilities. Decoding a usera s imagined speech instead of actual speech would solve this problem. Such a Brain-Computer Interface (BCI) based on imagined speech would enable fast and natural communication without the need to actually speak out loud. These interfaces could provide a voice to otherwise mute people. This dissertation investigates BCIs based on speech processes using functional Near In- frared Spectroscopy (fNIRS) and Electrocorticography (ECoG), two brain activity imaging modalities on opposing ends of an invasiveness scale. Brain activity data have low signal- to-noise ratio and complex spatio-temporal and spectral coherence. To analyze these data, techniques from the areas of machine learning, neuroscience and Automatic Speech Recog- nition are combined in this dissertation to facilitate robust classification of detailed speech processes while simultaneously illustrating the underlying neural processes. fNIRS is an imaging modality based on cerebral blood flow. It only requires affordable hardware and can be set up within minutes in a day-to-day environment. Therefore, it is ideally suited for convenient user interfaces. However, the hemodynamic processes measured by fNIRS are slow in nature and the technology therefore offers poor temporal resolution. We investigate speech in fNIRS and demonstrate classification of speech processes for BCIs based on fNIRS. ECoG provides ideal signal properties by invasively measuring electrical potentials artifact- free directly on the brain surface. High spatial resolution and temporal resolution down to millisecond sampling provide localized information with accurate enough timing to capture the fast process underlying speech production. This dissertation presents the Brain-to- Text system, which harnesses automatic speech recognition technology to decode a textual representation of continuous speech from ECoG. This could allow to compose messages or to issue commands through a BCI. While the decoding of a textual representation is unparalleled for device control and typing, direct communication is even more natural if the full expressive power of speech - including emphasis and prosody - could be provided. For this purpose, a second system is presented, which directly synthesizes neural signals into audible speech, which could enable conversation with friends and family through a BCI. Up to now, both systems, the Brain-to-Text and synthesis system are operating on audibly produced speech. To bridge the gap to the final frontier of neural prostheses based on imagined speech processes, we investigate the differences between audibly produced and imagined speech and present first results towards BCI from imagined speech processes. This dissertation demonstrates the usage of speech processes as a paradigm for BCI for the first time. Speech processes offer a fast and natural interaction paradigm which will help patients and healthy users alike to communicate with computers and with friends and family efficiently through BCIs

    Syväoppimiseen perustuva suomenkielinen neuroverkkopohjainen puhesynteesi ja sen suoriutuminen sulautetussa järjestelmässä

    Get PDF
    Tiivistelmä. Puhesynteesi on oleellinen osa nykyaikaisia ja yhä enemmän käytettäviä TTS-järjestelmiä. Nykyaikaiset TTS-järjestelmät kykenevät tuottamaan luonnollista ja monipuolista puhetta moneen eri käyttötarkoitukseen, kuten esimerkiksi helppokäyttöisyystyökaluihin. Kun toteutetaan suomenkielinen syväoppivaan neuroverkkoon perustuva järjestelmä osaksi sosiaalisen robotin kokonaisuutta, tulisi synteesin tapahtua tarpeeksi nopeasti ja syntetisoidun puheen tulisi olla laadukasta ja luonnollista, jotta mahdollinen vuorovaikutus robotin kanssa olisi mahdollisimman hyvälaatuista. Puhesynteesijärjestelmän tulisi siis toimia osana sosiaalista robottia yhdessä mahdollisen keskustelumoduulin ja puheentunnistusmoduulin kanssa. Toteutettu puhesynteesimoduuli vastaanottaa syötteenä tekstin, joka syntetisoidaan raaka-aaltomuotoiseksi ääneksi eli puheeksi tiedostoon, joka lopulta toistetaan kaiuttimesta. Perinteisesti puhesynteesin menetelmiä ovat olleet formanttisynteesi, HMM-pohjainen puhesynteesi ja konkatenaatiosynteesi, joista myöhemmin on edetty neuroverkkopohjaisiin menetelmiin. Jotta saavutetaan laadukasta puhesynteesiä, tulisi neuroverkkomallia kouluttaa mahdollisimman laajalla aineistolla, joka kattaisi suomen kielen äänteet ja sananpainotukset. End-to-end -syväoppimiseen perustuva neuroverkkopohjainen puhesynteesimalli VITS valittiin työhön toteutettavaksi malliksi. Mallin kouluttaminen toteutettiin Google Colab -ympäristössä, joka tarjoaa tarvittavia resursseja mallin kouluttamista varten. Neuroverkko koostuu tyypillisesti monista yksinkertaisista toisiinsa kytketyistä prosessoivista yksiköistä, joita kutsutaan myös neuroneiksi. Neuroverkon oppimisessa on kyse siitä, että löydetään haluttuja painotuksia näille neuroneiden yhteyksille, jotta neuroverkko saadaan käyttäytymään halutulla tavalla. Tässä työssä koulutettu neuroverkko oppi tuottamaan aaltomuotoista puhetta sille annetusta tekstisyötteestä. Kehitetty ja koulutettu puhesynteesimalli arvioitiin subjektiivisella MOS- asteikolla. Puhesynteesimalli kykeni suoriutumaan Raspberry Pi 4 Model B -minitietokoneella, mutta ei vaadittavalla nopeudella, jotta reaaliaikainen keskustelu olisi mahdollista ja luonnollista. Raspberry Pi 4 Model B -minitietokoneen suorituskyky ei siis riittänyt reaaliaikaiseen keskusteluun vaan TTS-järjestelmä toteutettiin robotin omaan ROS2-ympäristöön, jota suoritetaan tehokkaammalla tietokoneella ja jossa sitä voidaan käyttää mahdollisen keskusteluälyn ja puheentunnistuksen kanssa.Deep learning based speech synthesis system for the Finnish language and its performance in an embedded system. Abstract. Speech synthesis is an integral part of modern TTS systems, which are more prevalent today than in the past. Modern TTS systems are more capable to produce speech, which is more natural and varied, and which has increasingly more applications like on accessibility features. When deep learning based neural network for Finnish speech synthesis is applied as a part of social robot, the synthesized speech should be of high quality and natural. This ensures the best possible quality for an interaction between the user and the social robot. The speech synthesis system should be able to function as a part of a social robot alongside speech logic and speech recognition modules. The developed speech synthesis module takes text as an input, which is synthesized to a raw audio wave, and the resulting audio will be outputted from a speaker. Traditional methods of speech synthesis, which include formant synthesis, HMM-based synthesis and concatenative synthesis, are supplanted by modern neural network based speech synthesis systems. In order to get the best possible speech synthesis result, the neural network should be trained with an extensive dataset that encompasses all vocal phonemes and stresses in Finnish language. For this paper a neural network based end-to-end speech synthesis model VITS was chosen. Google Colab was used to train the model, because it had sufficient resources. A neural network is typically a collection of simple evaluation nodes which are called neurons. Learning occurs by manipulating the connections between neurons to produce a desired output. In this research, the neural network was trained to transform text into an audio wave output that resembles human speech. The speech synthesis model developed and trained in this paper was evaluated on a subjective MOS scale. It was capable of functioning on a Raspberry Pi 4 Model B computer, but real-time conversation was not feasible due to performance constraints. Since the Raspberry Pi 4 Model B couldn’t facilitate real-time speech synthesis for conversation, the TTS system was implemented in its own ROS2 environment on more robust hardware. This setup integrates with potential conversational AI and speech recognition models

    Phonetic Event-based Whole-Word Modeling Approaches for Speech Recognition

    Get PDF
    Speech is composed of basic speech sounds called phonemes, and these subword units are the foundation of most speech recognition systems. While detailed acoustic models of phones (and phone sequences) are common, most recognizers model words themselves as a simple concatenation of phonemes and do not closely model the temporal relationships between phonemes within words. Human speech production is constrained by the movement of speech articulators, and there is abundant evidence to indicate that human speech recognition is inextricably linked to the temporal patterns of speech sounds. Structures such as the hidden Markov model (HMM) have proved extremely useful and effective because they offer a convenient framework for combining acoustic modeling of phones with powerful probabilistic language models. However, this convenience masks deficiencies in temporal modeling. Additionally, robust recognition requires complex automatic speech recognition (ASR) systems and entails non-trivial computational costs. As an alternative, we extend previous work on the point process model (PPM) for keyword spotting, an approach to speech recognition expressly based on whole-word modeling of the temporal relations of phonetic events. In our research, we have investigated and advanced a number of major components of this system. First, we have considered alternate methods of determining phonetic events from phone posteriorgrams. We have introduced several parametric approaches to modeling intra-word phonetic timing distributions which allow us to cope with data sparsity issues. We have substantially improved algorithms used to compute keyword detections, capitalizing on the sparse nature of the phonetic input which permits the system to be scaled to large data sets. We have considered enhanced CART-based modeling of phonetic timing distributions based on related text-to-speech synthesis work. Lastly, we have developed a point process based spoken term detection system and applied it to the conversational telephone speech task of the 2006 NIST Spoken Term Detection evaluation. We demonstrate the PPM system to be competitive with state-of-the-art phonetic search systems while requiring significantly fewer computational resources
    corecore