394 research outputs found
Croatian HMM-based Speech Synthesis
The paper describes the development of a trainable speech synthesis system, based on hidden Markov models. An approach to speech signal generation using a source-filter model is presented. Inputs into the synthesis system are speech utterances and their phone level transcriptions. A method using context-dependent acoustic models and Croatian phonetic rules for speech synthesis is proposed. Croatian HMM-based speech synthesis experiments are presented and generated speech results are discussed
Automatsko raspoznavanje hrvatskoga govora velikoga vokabulara
This paper presents procedures used for development of a Croatian large vocabulary automatic speech recognition system (LVASR). The proposed acoustic model is based on context-dependent triphone hidden Markov models and Croatian phonetic rules. Different acoustic and language models, developed using a large collection of Croatian speech, are discussed and compared. The paper proposes the best feature vectors and acoustic modeling procedures using which lowest word error rates for Croatian speech are achieved. In addition, Croatian language modeling procedures are evaluated and adopted for speaker independent spontaneous speech recognition. Presented experiments and results show that the proposed approach for automatic speech recognition using context-dependent acoustic modeling based on Croatian phonetic rules and a parameter tying procedure can be used for efļ¬cient Croatian large vocabulary speech recognition with word error rates below 5%.Älanak prikazuje postupke akustiÄkog i jeziÄnog modeliranja sustava za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara. Predloženi akustiÄki modeli su zasnovani na kontekstno-ovisnim skrivenim Markovljevim modelima trifona i hrvatskim fonetskim pravilima. Na hrvatskome govoru prikupljenom u korpusu su ocjenjeni i usporeÄeni razliÄiti akustiÄki i jeziÄni modeli. U Älanku su usporeÄ eni i predloženi postupci za izraÄun vektora znaÄajki za akustiÄko modeliranje kao i sam pristup akustiÄkome modeliranju hrvatskoga govora s kojim je postignuta najmanja mjera pogreÅ”no raspoznatih rijeÄi. Predstavljeni su rezultati raspoznavanja spontanog hrvatskog govora neovisni o govorniku. Postignuti rezultati eksperimenata s mjerom pogreÅ”ke ispod 5% ukazuju na primjerenost predloženih postupaka za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara pomoÄu vezanih kontekstnoovisnih akustiÄkih modela na osnovu hrvatskih fonetskih pravila
Razvoj akustiÄkog modela hrvatskog jezika pomoÄu alata HTK
Paper presents development of the acoustic model for Croatian language for automatic speech recognition (ASR). Continuous speech recognition is performed by means of the Hidden Markov Models (HMM) implemented in the HMM Toolkit (HTK). In order to adjust the HTK to the native language a novel algorithm for Croatian language transcription (CLT) has been developed. It is based on phonetic assimilation rules that are applied within uttered words. Phonetic questions for state tying of different triphone models have also been developed. The automated system for training and evaluation of acoustic models has been developed and integrated with the new graphical user interface (GUI). Targeted applications of this ASR system are stress inoculation training (SIT) and virtual reality exposure therapy (VRET). Adaptability of the model to a closed set of speakers is important for such applications and this paper investigates the applicability of the HTK tool for typical scenarios. Robustness of the tool to a new language was tested in matched conditions by a parallel training of an English model that was used as a baseline. Ten native Croatian speakers participated in experiments. Encouraging results were achieved and reported with the developed model for Croatian language.Rad opisuje razvoj akustiÄkog modela hrvatskog jezika za potrebe sustava za automatsko prepoznavanje govora. Prepoznavanje prirodnog spojenog izgovora ostvaruje se koriÅ”tenjem skrivenih Markovljevih modela (HMM) u okviru alata HTK. U svrhu prilagodbe ovog alata na hrvatski jezik razvijen je novi algoritam za automatsku fonetsku transkripciju hrvatskih rijeÄi. Zasniva se na naÄelu fonetske asimilacije unutar izgovorenih rijeÄi. Razvijen je i skup fonetskih pitanja koji se koristi za klasifikaciju prilikom udruživanja trifonskih modela sliÄnih glasova. Razvijena je automatizirana aplikacija za gradnju i evaluaciju akustiÄkih modela, integrirana s novo razvijenim grafiÄkim suÄeljem. Primjene ovog sustava za prepoznavanje su trening s doziranim izlaganjem stresu (SIT) i terapija izlaganjem primjenom virtualne stvarnosti (VRET). Prilagodljivost akustiÄkog modela na zatvoren skup govornika vrlo je važna za takve primjene, pa se u radu istražuje primjenjivost alata HTK u tipiÄnim scenarijima. Robusnost alata na promjenu jezika istražuje se uparenim treniranjem i evaluacijom ekvivalentnog modela engleskog jezika u jednakim uvjetima. U eksperimentima je sudjelovalo deset izvornih hrvatskih govornika. Ostvareni rezultati za hrvatski jezik prikazani u radu pokazuju zadovoljavajuÄa svojstva razvijenog akustiÄkog modela hrvatskog jezika
Automatic Intonation Event Detection Using Tilt Model for Croatian Speech Synthesis
Text-to-speech systems convert text into speech. Synthesized speech without prosody sounds unnatural and monotonous. In order to sound natural, prosodic elements have to be implemented. The generation of prosodic elements directly from text is a rather demanding task. Our final goals are building a complete prosodic model for Croatian and implementing it into our TTS system. In this work, we present one of the steps in implementation of prosody into TTSs ā detection of intonation events using Tilt intonation model. We propose a training procedure which is composed of several subtasks. First, we hand-labelled a set of utterances and within each of them, marked four types of prosodic events. Then we trained HMMs and used them to mark prosodic events on a larger set of utterances. We estimate parameters for each of the intonation event and generated f0 contours from the parameters. Finally, we evaluated the obtained f0 contours
Primjena automatskog meÄujeziÄnog akustiÄnog modeliranja na HMM sintezu govora za oskudne jeziÄne baze
Nowadays Human Computer Interaction (HCI) can also be achieved with voice user interfaces (VUIs). To enable devices to communicate with humans by speech in the user\u27s own language, low-cost language portability is often discussed and analysed. One of the most time-consuming parts for the language-adaptation process of VUI-capable applications is the target-language speech-data acquisition. Such data is further used in the development of VUIs subsystems, especially of speech-recognition and speech-production systems.The tempting idea to bypass a long-term process of data acquisition is considering the design and development of an automatic algorithms, which can extract the similar target-language acoustic from different language speech databases.This paper focus on the cross-lingual phoneme mapping between an under-resourced and a well-resourced language. It proposes a novel automatic phoneme-mapping technique that is adopted from the speaker-verification field. Such a phoneme mapping is further used in the development of the HMM-based speech-synthesis system for the under-resourced language. The synthesised utterances are evaluated with a subjective evaluation and compared by the expert knowledge cross-language method against to the baseline speech synthesis based just from the under-resourced data. The results reveals, that combining data from well-resourced and under-resourced language with the use of the proposed phoneme-mapping technique, can improve the quality of under-resourced language speech synthesis.U danaÅ”nje vrijeme interakcija Äovjeka i raÄunala (HCI) može se ostvariti i putem govornih suÄelja (VUIs). Da bi se omoguÄila komunikacija ureÄaja i korisnika putem govora na vlastitom korisnikovom jeziku, Äesto se raspravlja i analizira o jeftinom rjeÅ”enju prijevoda govora na razliÄite jezike. Jedan od vremenski najzahtjevnijih dijelova procesa prilagodbe jezika za aplikacije koje podržavaju VUI je prikupljanje govornih podataka za ciljani jezik. Ovakvi podaci dalje se koriste za razvoj VUI podsustava, posebice za prepoznavanje i produkciju govora. Primamljiva ideja za izbjegavanje dugotrajnog postupka prikupljanja podataka jeste razmatranje sinteze i razvoja automatskih algoritama koji su sposobni izvesti sliÄna akustiÄna svojstva za ciljani jezik iz postojeÄih baza razliÄitih jezika.Ovaj rad fokusiran je na povezivanje meÄujeziÄnih fonema izmeÄu oskudnih i bogatih jeziÄnih baza. Predložena je nova tehnika automatskog povezivanja fonema, usvojena i prilagoÄena iz podruÄja govorne autentikacije. Ovakvo povezivanje fonema kasnije se koristi za razvoj sustava za sintezu govora zasnovanom na HMM-u za manje poznate jezike. NaÄinjene govorne izjave ocijenjene su subjektivnim pristupom kroz usporedbu meÄujeziÄnih metoda visoke razine poznavanja jezika u odnosu na sintezu govora naÄinjenu iz oskudne jeziÄne baze. Rezultati otkrivaju da kombinacija oskudne i bogate baze jezika uz primjenu predložene tehnike povezivanja fonema može unaprijediti kvalitetu sinteze govora iz oskudne jeziÄne baze
Macedonian Speech Synthesis for Assistive Technology Applications
Speech technology is becoming ever more ubiquitous with the advance of speech
enabled devices and services. The use of speech synthesis in Augmentative and
Alternative Communication tools, has facilitated inclusion of individuals with
speech impediments allowing them to communicate with their surroundings using
speech. Although there are numerous speech synthesis systems for the most
spoken world languages, there is still a limited offer for smaller languages.
We propose and compare three models built using parametric and deep learning
techniques for Macedonian trained on a newly recorded corpus. We target
low-resource edge deployment for Augmentative and Alternative Communication and
assistive technologies, such as communication boards and screen readers. The
listening test results show that parametric speech synthesis is as performant
compared to the more advanced deep learning models. Since it also requires less
resources, and offers full speech rate and pitch control, it is the preferred
choice for building a Macedonian TTS system for this application scenario.Comment: 5 pages, 1 figure, EUSIPCO conference 202
Cross-Lingual Neural Network Speech Synthesis Based on Multiple Embeddings
The paper presents a novel architecture and method for speech synthesis in multiple languages, in voices of multiple speakers and in multiple speaking styles, even in cases when speech from a particular speaker in the target language was not present in the training data. The method is based on the application of neural network embedding to combinations of speaker and style IDs, but also to phones in particular phonetic contexts, without any prior linguistic knowledge on their phonetic properties. This enables the network not only to efficiently capture similarities and differences between speakers and speaking styles, but to establish appropriate relationships between phones belonging to different languages, and ultimately to produce synthetic speech in the voice of a certain speaker in a language that he/she has never spoken. The validity of the proposed approach has been confirmed through experiments with models trained on speech corpora of American English and Mexican Spanish. It has also been shown that the proposed approach supports the use of neural vocoders, i.e. that they are able to produce synthesized speech of good quality even in languages that they were not trained on
- ā¦