10 research outputs found
An evolutionary approach for determining Hidden Markov Model for medical image analysis
Hidden Markov Model (HMM) is a technique highly capable of modelling the structure of an observation sequence. In this paper, HMM is used to provide the contextual information for detecting clinical signs present in diabetic retinopathy screen images. However, there is a need to determine a feature set that best represents the complexity of the data as well as determine an optimal HMM. This paper addresses these problems by automatically selecting the best feature set while evolving the structure and obtaining the parameters of a Hidden Markov Model. This novel algorithm not only selects the best feature set, but also identifies the topology of the HMM, the optimal number of states, as well as the initial transition probabilities. © 2012 IEEE
Rapid Generation of Pronunciation Dictionaries for new Domains and Languages
This dissertation presents innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. Depending on various conditions, solutions are proposed and developed. Starting from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists
Predicting user acceptance of Tamil speech to text by native Tamil Brahmans
This thesis investigates and predicts the user acceptance of a speech to text application in Tamil and takes the view that user acceptance model would need to take into, the cultural constraints that apply in the context and underlines the need for a more explicit recognition. The user acceptance models such as Technology Acceptance Model (TAM) predominantly focus on the technological aspects to determine the acceptance. The cultural variables are considered as external but at the same time they acknowledge the influence of user acceptance due to external variables. The contribution to knowledge is, an empirical link between Tamil usage at a social level that indicates the ability to use and accept Tamil speech to text application. The economic value of Tamil, does not seem to warrant technology use and therefore, speech to text in Tamil was found to be less acceptable in the study samples.
In order to achieve the objective of predicting the user acceptance of speech to text in Tamil by the native Tamil speaking Brahmans, the researcher designed and evaluated a paper prototype of an iPhone iOS mobile representation of the paper prototype on the idea of 'what you speak is what you get'. As a result of the researcher’s insider position, the idea was to convert the speech as spoken by the person into Tamil orthography without any technological interference such as auto correct, word prediction and spell check. Due to the syllabic nature of the language and the cultural tendency to code mix and code-switch, the investigation focused on three key areas- code mixing, pronunciation and choice of script . This thesis looks at the complexities involved in accommodating these areas. The user's choice of script was increasingly important as it cannot be assumed that all native Tamil speakers are able to read and write Tamil.
In order to bring in rich data, the researcher used the insider and outsider positionality alongside phenomenology. This was also to overcome any potential bias in analysis and interpretation. The multidisciplinary approach to answer the research question was inevitable owing to cultural variables like value and usage of language, social perception of language and its usage specifically code-switching, pronunciation and orthography in the native space.
4
Data gathering was done using quantitative study of transliteration and qualitative interviews of Tamil speaking Brahmans. The findings point to the Vedic philosophical texts and practices that influenced the attitude of the respondents on how words must be pronounced and how they ought to appear in text. The development of the speech to text application could be enriched by using a native approach that embeds cultural and philosophical values.
Based on the findings, this thesis has identified areas for further research which is to widely test the user acceptance model proposed in this thesis to aid development of speech to text and to further investigate on native perspective in the wider diaspora and also to investigate cultural and philosophical relevance in speech to text in other languages where technology is in developing stage
Pronunciation modelling in end-to-end text-to-speech synthesis
Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve
high-quality naturalness scores without extensive processing of text-input. Since S2S
models have been proposed in multiple aspects of the TTS pipeline, the field has focused
on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform
is predicted directly from a sequence of text or phone characters. Early work on E2ETTS
in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation
(lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder
during training. The benefits of a learned text encoding include improved modelling
of phonetic context, which make contextual linguistic features traditionally used in
TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar
naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling
of phonetic context has led some to question the benefit of using phone- instead of
text-input altogether (see [5]).
The use of text-input brings into question the value of the pronunciation lexicon
in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme
(G2P) model from text-audio pairs during training. With common datasets
for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates
compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation
is difficult for some words (e.g. foreign words and proper names) since
the knowledge to disambiguate their pronunciations may not be provided by the local
grapheme context and may require knowledge beyond that contained in sentence-level
text-audio sequences. When test stimuli were selected according to G2P difficulty,
increased mispronunciations in E2E-TTS with text-input were observed. Following
the proposed benefits of subword decomposition in S2S modelling in other language
tasks (e.g. neural machine translation), the effects of morphological decomposition
were investigated on pronunciation modelling. Learning of the French post-lexical
phenomenon liaison was also evaluated.
With the goal of an inexpensive, large-scale evaluation of pronunciation modelling,
the reliability of automatic speech recognition (ASR) to measure TTS intelligibility
was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge
was conducted. ASR reliably found similar significant differences between systems
as paid listeners in controlled conditions in English. An analysis of transcriptions for
words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR
Transformer model used was found to be unreliable in its transcription of difficult G2P
relations due to homophonic transcription and incorrect transcription of words with
difficult G2P relations. A further evaluation of representation mixing in Tacotron finds
pronunciation correction is possible when mixing text- and phone-inputs. The thesis
concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a
pronunciation guide since it can provide assurances that G2P generalisation cannot