10 research outputs found

    Evolving the structure of Hidden Markov models for micro aneurysms detection

    Full text link

    An evolutionary approach for determining Hidden Markov Model for medical image analysis

    Get PDF
    Hidden Markov Model (HMM) is a technique highly capable of modelling the structure of an observation sequence. In this paper, HMM is used to provide the contextual information for detecting clinical signs present in diabetic retinopathy screen images. However, there is a need to determine a feature set that best represents the complexity of the data as well as determine an optimal HMM. This paper addresses these problems by automatically selecting the best feature set while evolving the structure and obtaining the parameters of a Hidden Markov Model. This novel algorithm not only selects the best feature set, but also identifies the topology of the HMM, the optimal number of states, as well as the initial transition probabilities. © 2012 IEEE

    Rapid Generation of Pronunciation Dictionaries for new Domains and Languages

    Get PDF
    This dissertation presents innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. Depending on various conditions, solutions are proposed and developed. Starting from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists

    Predicting user acceptance of Tamil speech to text by native Tamil Brahmans

    Get PDF
    This thesis investigates and predicts the user acceptance of a speech to text application in Tamil and takes the view that user acceptance model would need to take into, the cultural constraints that apply in the context and underlines the need for a more explicit recognition. The user acceptance models such as Technology Acceptance Model (TAM) predominantly focus on the technological aspects to determine the acceptance. The cultural variables are considered as external but at the same time they acknowledge the influence of user acceptance due to external variables. The contribution to knowledge is, an empirical link between Tamil usage at a social level that indicates the ability to use and accept Tamil speech to text application. The economic value of Tamil, does not seem to warrant technology use and therefore, speech to text in Tamil was found to be less acceptable in the study samples. In order to achieve the objective of predicting the user acceptance of speech to text in Tamil by the native Tamil speaking Brahmans, the researcher designed and evaluated a paper prototype of an iPhone iOS mobile representation of the paper prototype on the idea of 'what you speak is what you get'. As a result of the researcher’s insider position, the idea was to convert the speech as spoken by the person into Tamil orthography without any technological interference such as auto correct, word prediction and spell check. Due to the syllabic nature of the language and the cultural tendency to code mix and code-switch, the investigation focused on three key areas- code mixing, pronunciation and choice of script . This thesis looks at the complexities involved in accommodating these areas. The user's choice of script was increasingly important as it cannot be assumed that all native Tamil speakers are able to read and write Tamil. In order to bring in rich data, the researcher used the insider and outsider positionality alongside phenomenology. This was also to overcome any potential bias in analysis and interpretation. The multidisciplinary approach to answer the research question was inevitable owing to cultural variables like value and usage of language, social perception of language and its usage specifically code-switching, pronunciation and orthography in the native space. 4 Data gathering was done using quantitative study of transliteration and qualitative interviews of Tamil speaking Brahmans. The findings point to the Vedic philosophical texts and practices that influenced the attitude of the respondents on how words must be pronounced and how they ought to appear in text. The development of the speech to text application could be enriched by using a native approach that embeds cultural and philosophical values. Based on the findings, this thesis has identified areas for further research which is to widely test the user acceptance model proposed in this thesis to aid development of speech to text and to further investigate on native perspective in the wider diaspora and also to investigate cultural and philosophical relevance in speech to text in other languages where technology is in developing stage

    Pronunciation modelling in end-to-end text-to-speech synthesis

    Get PDF
    Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve high-quality naturalness scores without extensive processing of text-input. Since S2S models have been proposed in multiple aspects of the TTS pipeline, the field has focused on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform is predicted directly from a sequence of text or phone characters. Early work on E2ETTS in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation (lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder during training. The benefits of a learned text encoding include improved modelling of phonetic context, which make contextual linguistic features traditionally used in TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling of phonetic context has led some to question the benefit of using phone- instead of text-input altogether (see [5]). The use of text-input brings into question the value of the pronunciation lexicon in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme (G2P) model from text-audio pairs during training. With common datasets for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation is difficult for some words (e.g. foreign words and proper names) since the knowledge to disambiguate their pronunciations may not be provided by the local grapheme context and may require knowledge beyond that contained in sentence-level text-audio sequences. When test stimuli were selected according to G2P difficulty, increased mispronunciations in E2E-TTS with text-input were observed. Following the proposed benefits of subword decomposition in S2S modelling in other language tasks (e.g. neural machine translation), the effects of morphological decomposition were investigated on pronunciation modelling. Learning of the French post-lexical phenomenon liaison was also evaluated. With the goal of an inexpensive, large-scale evaluation of pronunciation modelling, the reliability of automatic speech recognition (ASR) to measure TTS intelligibility was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge was conducted. ASR reliably found similar significant differences between systems as paid listeners in controlled conditions in English. An analysis of transcriptions for words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR Transformer model used was found to be unreliable in its transcription of difficult G2P relations due to homophonic transcription and incorrect transcription of words with difficult G2P relations. A further evaluation of representation mixing in Tacotron finds pronunciation correction is possible when mixing text- and phone-inputs. The thesis concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a pronunciation guide since it can provide assurances that G2P generalisation cannot
    corecore