154 research outputs found
A Review of Accent-Based Automatic Speech Recognition Models for E-Learning Environment
The adoption of electronics learning (e-learning) as a method of disseminating knowledge in the global educational system is growing at a rapid rate, and has created a shift in the knowledge acquisition methods from the conventional classrooms and tutors to the distributed e-learning technique that enables access to various learning resources much more conveniently and flexibly. However, notwithstanding the adaptive advantages of learner-centric contents of e-learning programmes, the distributed e-learning environment has unconsciously adopted few international languages as the languages of communication among the participants despite the various accents (mother language influence) among these participants. Adjusting to and accommodating these various accents has brought about the introduction of accents-based automatic speech recognition into the e-learning to resolve the effects of the accent differences. This paper reviews over 50 research papers to determine the development so far made in the design and implementation of accents-based automatic recognition models for the purpose of e-learning between year 2001 and 2021. The analysis of the review shows that 50% of the models reviewed adopted English language, 46.50% adopted the major Chinese and Indian languages and 3.50% adopted Swedish language as the mode of communication. It is therefore discovered that majority of the ASR models are centred on the European, American and Asian accents, while unconsciously excluding the various accents peculiarities associated with the less technologically resourced continents
Analysis Of Variation In The Number Of MFCC Features In Contrast To LSTM In The Classification Of English Accent Sounds
Various studies have been carried out to classify English accents using traditional classifiers and modern classifiers. In general, research on voice classification and voice recognition that has been done previously uses the MFCC method as voice feature extraction. The stages in this study began with importing datasets, data preprocessing of datasets, then performing MFCC feature extraction, conducting model training, testing model accuracy and displaying a confusion matrix on model accuracy. After that, an analysis of the classification has been carried out. The overall results of the 10 tests on the test set show the highest accuracy value for feature 17 value of 64.96% in the test results obtained some important information, including; The test results on the MFCC coefficient values of twelve to twenty show overfitting. This is shown in the model training process which repeatedly produces high accuracy but produces low accuracy in the classification testing process. The feature assignment on MFCC shows that the higher the feature value assignment on MFCC causes a very large sound feature dimension. With the large number of features obtained, the MFCC method has a weakness in determining the number of features
Representation Learning for Spoken term Detection
Spoken Term Detection (STD) is the task of searching a given spoken query word in large speech database. Applications of STD include speech data indexing, voice dialling, telephone monitoring and data mining. Performance of STD depends mainly on representation of speech signal and matching of represented signal. This work investigates methods for robust representation of speech signal, which is invariant to speaker variability, in the context of STD task. Here the representation is in the form of templates, a sequence of feature vectors. Typical representation in speech community Mel-Frequency CepstralCoe cients (MFCC) carry both speech-specific and speaker-specific information, so the need for better representation. Searching is done by matching sequence of feature vectors of query and reference utterances by using Subsequence Dynamic Time Warping (DTW). The performance of the proposed representation is evaluated on Telugu broadcast news data. In the absence of labelled data i.e., in unsupervised setting, we propose to capture joint density of acoustic space spanned by MFCCs using Gaussian Mixture Models (GMM) and Gaussian-Bernoulli Restricted Boltzmann Machines (GBRBM). Posterior features extracted from trained models are used to search the query word. It is noticed that 8% and 12% improvement in STD performance compared to MFCC by using GMM and GBRBM posterior features respectively. As transcribed data is not required, this approach is optimal solution to low-resource languages. But due to it’s intermediate performance, this method cannot be immediate solution to high resource language
Recommended from our members
Deep Learning for Automatic Assessment and Feedback of Spoken English
Growing global demand for learning a second language (L2), particularly English, has led to
considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications.
This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One
of the challenges in automatic spoken language assessment is giving candidates feedback on
particular aspects, or views, of their spoken language proficiency, in addition to the overall
holistic score normally provided. Another is detecting pronunciation and other types of errors
at the word or utterance level and feeding them back to the learner in a useful way.
It is usually difficult to obtain accurate training data with separate scores for different
views and, as examiners are often trained to give holistic grades, single-view scores can
suffer issues of consistency. Conversely, holistic scores are available for various standard
assessment tasks such as Linguaskill. An investigation is thus conducted into whether
assessment scores linked to particular views of the speaker’s ability can be obtained from
systems trained using only holistic scores.
End-to-end neural systems are designed with structures and forms of input tuned to single
views, specifically each of pronunciation, rhythm, intonation and text. By training each
system on large quantities of candidate data, individual-view information should be possible
to extract. The relationships between the predictions of each system are evaluated to examine
whether they are, in fact, extracting different information about the speaker. Three methods
of combining the systems to predict holistic score are investigated, namely averaging their
predictions and concatenating and attending over their intermediate representations. The
combined graders are compared to each other and to baseline approaches.
The tasks of error detection and error tendency diagnosis become particularly challenging
when the speech in question is spontaneous and particularly given the challenges posed by
the inconsistency of human annotation of pronunciation errors. An approach to these tasks is
presented by distinguishing between lexical errors, wherein the speaker does not know how a
particular word is pronounced, and accent errors, wherein the candidate’s speech exhibits
consistent patterns of phone substitution, deletion and insertion. Three annotated corpora
x
of non-native English speech by speakers of multiple L1s are analysed, the consistency of
human annotation investigated and a method presented for detecting individual accent and
lexical errors and diagnosing accent error tendencies at the speaker level
Consequences of bi-literacy in bilingual individuals: in the healthy and neurologically impaired
Background. In the current global, cross-cultural scenario, being bilingual or multilingual is a norm rather than an exception. In such an environment an individual may be actively involved in reading and writing in all their languages in addition to speaking them. Regular use of two or more languages is termed as bilingualism and being able to read and write in both of them is referred to as bi-literacy. Research indicates that bilingualism has an impact on language production and cognition, specifically executive functions. Given the impact of literacy and bilingualism, the reasonable question that arises, is whether bi-literacy would offer an additional impact on language production and cognition. This becomes even more relevant in a multilingual, multi-cultural society such as India. We examined the impact of bi-literacy on oral language production (at word and connected speech level), comprehension and on non-verbal executive function measures in bi-literate bilingual healthy adults in an immigrant diaspora living in the UK. In addition to English, they were speakers of one of the South Indian languages (Kannada, Malayalam, Tamil and Telugu). The significance of bi-literacy among bilinguals assumes further importance in aphasia (language impairment due to brain damage). For those who have aphasia in one or more languages due to brain damage, the severity of impairment maybe different in both languages, also the modalities of language may be differentially affected. In particular, reading and writing maybe impaired differently in the languages used by a bi/multilingual. Manifestation of reading impairments are also dependent on the nature of the script of the language being read [e.g., Raman & Weekes (2005) report differential dyslexia in a Turkish-English speaker who exhibited surface dyslexia in English and deep dysgraphia in Turkish]. Our study contributes to the field of bilingual aphasia by focusing specifically on reading differing from the existing literature of aphasia in bilinguals, where the focus has predominantly been on language production and comprehension. Studying reading impairments provides a better understanding of how the reading impairments are manifested in the two languages, which will aid appropriate assessment and intervention. This research investigated the impact of bi-literacy in both populations (healthy adults and neurologically impaired) in two phases: Phase I (in UK) and Phase II (in India).
Aim. Phase I investigated the impact of bi-literacy on oral language production (at word level and connected speech), comprehension and non-verbal executive function in bi-literate bilingual healthy adults. Phase II examined the reading impairments in two languages of bilingual persons with aphasia (BPWA).
Methods. For Phase I, participants were thirty-four bi-literate bilingual healthy adults with English as their L2 and one of the Dravidian languages (Kannada, Malayalam, Tamil and Telugu) as their L1. We have used the term ‘print exposure’ as a proxy for literacy. They were divided into a high print exposure (HPE, n=22) and a low print exposure (LPE, n=12) group based on their performance on two tasks measuring L2 print exposure- grammaticality judgement task and sentence verification task. We also quantified their bilingual characteristics- proficiency, reading and writing characteristics and dominance. The groups were matched on years of education, age and gender. Participants completed a set of oral language production tasks in L2 (at word level) namely -verbal fluency, word and non-word repetition; comprehension tasks in L2 namely synonymy triplets task and sentence comprehension task (Chapter 2); oral narrative task in L2 (at connected speech level) (Chapter 3) followed by non-verbal executive function tasks tapping into inhibitory control (Spatial Stroop and Flanker tasks), working memory (visual n-back and auditory n-back) and task switching (colour-shape task) (Chapter 4).
For Phase II, we characterized the reading abilities of four BPWA who spoke one of the Dravidian languages (Kannada, Tamil, Telugu) (alpha-syllabic) as their L1 and English (alphabetic) as their L2. We quantified their bilingual characteristics- proficiency, reading and writing characteristics and dominance. Subtests from the Psycholinguistic Assessment of Language Processing in Aphasia (PALPA; Kay, Lesser & Coltheart, 1992) were used to document the reading profile of BPWA in English and reading subtests from Reading Acquisition Profile (RAP-K; Rao, 1997) and words from Bilingual Aphasia test -Hindi (BAT; Paradis & Libben, 1987) were used to document the reading profile of BPWA in Kannada and Hindi respectively.
Findings. Based on the findings of Phase I (i.e., results from Chapter 2-4), we found prominent differences between HPE and LPE on comprehension measures (synonymy triplets and sentence comprehension tasks). This is in contrast to the results observed in monolingual adults, were semantics is less impacted by print exposure. Moreover, our predictions that HPE will result in better oral language production skills were borne out in specific conditions-semantic fluency and non-word repetition task (at word level) and higher number of words in the narrative, higher verbs per utterance and fewer repetitions (at connected speech level). In addition, the non-verbal executive functions, we found no direct link between print exposure (in L2) and non-verbal executive functions in bi-literate bilinguals excepting working memory (auditory N-back task). Additionally, another consistency in our findings is that there seems to be a strong link between print exposure and semantic processing in our research. The findings on the semantic tasks have been consistent across comprehension (synonymy triplets task and sentence comprehension task) and production (semantic fluency) favouring HPE. The findings from Phase II (Chapter 5) reveal differences of reading characteristics in the two languages (with different scripts) of the four BPWA. This research provides preliminary evidence that a script related difference exists in the manifestation of dyslexia in bi-scriptal BPWA speaking a combination of alphabetic and alpha-syllabic languages.
Conclusions. Our research contributes to the existing literature by highlighting the relationship between bi-literacy and language production, comprehension and non-verbal cognition where bi-literacy seems to have a higher impact on language than cognition. The contrary findings from the monolinguals and children literature, highlight the importance for considering nuances of bilingual research and specifically challenges the notion that semantic comprehension is not significantly affected by literacy. In the neurologically impaired population, our research provides a comprehensive profiling of reading abilities in BPWA in the Indian population with languages having different scripts. Using this profiling and classification, we are able to affirm the findings previously found in literature emphasizing the importance of script in the assessment of reading abilities in BPWA. Such profiling and classification assist in the development of bilingual models of reading aloud and classifying different types of reading impairments
Recommended from our members
Genealogies of the Citizen-Devotee: Popular Cinema, Religion and Politics in South India
This dissertation is a genealogical study of the intersections between popular cinema, popular religion and politics in South India. It proceeds with a particular focus on the discursive field of Telugu cinema as well as religion and politics in the state of Andhra Pradesh from roughly the 1950s to the 2000s. By discursive field of cinema, I refer to not only filmic texts, but also disciplines of film making, practices of publicity, modes of film criticism as well as practices of viewership all of which are an inalienable part of the institution of cinema. Telugu cinema continued to produce mythological and devotional films based mostly on Hindu myths and legends many decades after they ceased to be major genres in Hindi and many other Indian languages. This was initially seen simply as an example of the insufficiently modernized and secularized nature of the South Indian public, and of the enduring nature of Indian religiosity. However, these films acquired an even greater notoriety later. In 1982, N.T. Rama Rao, a film star who starred in the roles of Hindu gods like Rama and Krishna in many mythologicals set up a political party, contested and won elections, and became the Chief Minister of the state, all in the space of a year. For many political and social commentators this whirlwind success could only be explained by the power of his cinematic image as god and hero! The films thus came to be seen as major contributing factors in the unusual and undesirable alliance between cinema, religion and politics. This dissertation does not seek to refute the links between these different fields; on the contrary it argues that the cinema is a highly influential and popular cultural institution in India and as such plays a very significant role in mediating both popular religion and politics. Hence, we need a fuller critical exploration of the intersections and overlaps between these realms that we normally think ought to exist in independent spheres. This dissertation contributes to such an exploration. A central argument this dissertation makes is about the production of the figure of the citizen-devotee through cinema and other media discourses. Through the use of this hyphenated word, citizen-devotee, this study points to the mutual and fundamental imbrication of the two ideas and concepts. In our times, the citizen and devotee do not and cannot exist as independent figures but necessarily contaminate each other. On the one hand, the citizen-devotee formulation indicates that the citizen ideal is always traversed by, and shot through with other formations of subjectivity that inflect it in significant ways. On the other hand, it points to the incontrovertible fact that in modern liberal democracies, it is impossible to simply be a devotee (bhakta) where one's allegiance is only to a particular faith or mode of being. On the contrary, willingly or unwillingly one is enmeshed in the discourse of rights and duties, subjected to the governance of the state, the politics of identity and the logics of majority and minority and so on. Religion as we know it today is itself the product of an encounter with modern rationalities of power and the modern media. Hence, we cannot simply talk about the citizen or the devotee, but only of the modern hybrid formation, the citizen-devotee. The first full length study of the Telugu mythological and devotional films, this dissertation combines a historical account of Telugu cinema with an anthropology of film making and viewership practices. It draws on film and media theory to foreground the specificity of these technologies and the new kind of publics they create. Anthropological theories of religion, secularism and the formation of embodied and affective subjects are combined with political theories of citizenship and governmentality to complicate our understanding of the overlapping formations of film spectators, citizens and devotees
Foreigner-directed speech and L2 speech learning in an understudied interactional setting: the case of foreign-domestic helpers in Oman
Ph. D. (Integrated) ThesisSet in Arabic-speaking Oman, the present study investigates whether speech directed to foreign domestic helpers (FDH-directed speech) is modified when compared with speech addressed to native Arabic speakers. It also explores the FDH’s ability to learn the sound system of their L2 in a near-naturalistic setting. In relation to input, the study explores whether there are any adaptations in native speakers’ realizations of complex Arabic consonants, consonant clusters, and vowels in FDH-directed speech. By doing so, it compares the phonetic features of FDH-directed speech in relation to other speech registers such as foreigner-directed speech (FDS), infant-directed speech (IDS) and clear speech. The study also investigates whether foreign accentedness, religion and Arabic language experience, as indexed by length of residence (LoR), play a role in the extent of adaptations present in FDH-directed speech. In relation to L2 speech learning, the study investigates the extent to which FDHs are sensitive to the phonemic contrasts of Arabic and whether their production of complex Arabic consonants and consonant clusters is target-like. It also examines the social and linguistic factors (LoR, first and second language literacy) that play a role in the learnability of these sounds.
Speech recordings were collected from 22 Omani female native Arabic speakers who interacted 1) with their FDHs and 2) with a native-speaking adult (the order was reversed for half of the participants), in both instances using a spot the difference task. A picture naming task was then used to collect data for production data by the same FDHs, while perception data consisted of an AX forced choice task.
Results demonstrate the distinctiveness of FDH-directed speech from other speech registers. Neither simplification of complex sounds nor hyperarticulation of consonant contrasts were attested in FDH-directed speech, despite them being reported in other studies on FDS and IDS. We attribute this to the familiarity of the native speakers with their FDHs and the formulaic nature of their daily interactions. Expansion of vowel space was evident in this study, conforming with other FDS studies. Results from perception and production tasks revealed that FDHs fell short of native-like performance, despite the more naturalistic setting and regardless of LoR. L1 and L2 literacy played varying roles in FDHs’ phonological sensitivity and production of certain contrasts. The study is original is terms of showing that FDS is not an
automatic outcome of interactions with L2 speakers and links these results with the unusual social setting
Code switching, language mixing and fused lects : language alternation phenomena in multilingual Mauritius
Focusing on a series of multiparty recordings carried out between the months of October and March 2012 and drawing on a theoretical framework based on work of linguists such as Auer (1999), Backus (2005), Bakker (2000), Maschler (2000) and Matras (2000a and 2000b), this thesis traces the evolution of a continuum of language alternation phenomena, ranging from simple code-switching to more complex forms of 'language alloying' (Alvarez- Càccamo 1998) such as mixed codes and fused lects in multilingual Mauritius. Following Auer (2001), the different conversational loci of code-switching are identified. Particular emphasis has been placed upon, amongst others, the conversational locus of playfulness where, for instance, participants' spontaneous lapses into song and dance sequences as they inspire themselves from Bollywood pop songs and creatively embed segments in Hindustani within a predominantly Kreol matrix are noted. Furthermore, in line with Auer (1999), Backus (2005) and Muysken (2000), emerging forms of language mixing such as changes in the way possessive marking is carried in Kreol and instances of semantic shift in Bhojpuri/ Hindustani words like nasha and daan have been highlighted and their pragmatic significance explained with specific reference to the Mauritian context. Finally, in the fused lect stage, specific attention has been provided to one key feature namely phonological blending which has resulted in the coinage of the discourse marker ashe and its eventual use in the process of discourse marker switching. In the light of the above findings, this thesis firstly critiques the strengths and weaknesses of the notion of the code switching (CS) continuum (Auer 1999) itself by revealing the difficulties encountered, at the empirical level, in assigning the correct label to the different types of language alternation phenomena evidenced in this thesis. In the second instance, it considers the impact of such shifts along the language alternation continuum upon language policy and planning in contemporary Mauritius and advocates for a move away from colonial language policies such as the 1957 Education Act in favour of updated ones that are responsive to the language practices of speakers.Linguistics and Modern LanguagesD. Litt. et Phil. (Linguistics
Recommended from our members
Identifying Speaker State from Multimodal Cues
Automatic identification of speaker state is essential for spoken language understanding, with broad potential in various real-world applications. However, most existing work has focused on recognizing a limited set of emotional states using cues from a single modality. This thesis describes my research that addresses these limitations and challenges associated with speaker state identification by studying a wide range of speaker states, including emotion and sentiment, humor, and charisma, using features from speech, text, and visual modalities.
The first part of this thesis focuses on emotion and sentiment recognition in speech. Emotion and sentiment recognition is one of the most studied topics in speaker state identification and has gained increasing attention in speech research recently, with extensive emotional speech models and datasets published every year. However, most work focuses only on recognizing a set of discrete emotions in high-resource languages such as English, while in real-life conversations, emotion is changing continuously and exists in all spoken languages. To address the mismatch, we propose a deep neural network model to recognize continuous emotion by combining inputs from raw waveform signals and spectrograms. Experimental results on two datasets show that the proposed model achieves state-of-the-art results by exploiting both waveforms and spectrograms as input. Due to the higher number of existing textual sentiment models than speech models in low-resource languages, we also propose a method to bootstrap sentiment labels from text transcripts and use these labels to train a sentiment classifier in speech. Utilizing the speaker state information shared across modalities, we extend speech sentiment recognition from high-resource languages to low-resource languages. Moreover, using the natural verse-level alignment in the audio Bibles across different languages, we also explore cross-lingual and cross-modality sentiment transfer.
In the second part of the thesis, we focus on recognizing humor, whose expression is related to emotion and sentiment but has very different characteristics. Unlike emotion and sentiment that can be identified by crowdsourced annotators, humorous expressions are highly individualistic and cultural-specific, making it hard to obtain reliable labels. This results in the lack of data annotated for humor, and thus we propose two different methods to automatically and reliably label humor. First, we develop a framework for generating humor labels on videos, by learning from extensive user-generated comments. We collect and analyze 100 videos, building multimodal humor detection models using speech, text, and visual features, which achieves an F1-score of 0.76. In addition to humorous videos, we also develop another framework for generating humor labels on social media posts, by learning from user reactions to Facebook posts. We collect 785K posts with humor and non-humor scores and build models to detect humor with performance comparable to human labelers.
The third part of the thesis focuses on charisma, a commonly found but less studied speaker state with unique challenges -- the definition of charisma varies a lot among perceivers, and the perception of charisma also varies with speakers' and perceivers' different demographic backgrounds. To better understand charisma, we conduct the first gender-balanced study of charismatic speech, including speakers and raters from diverse backgrounds. We collect personality and demographic information from the rater as well as their own speech, and examine individual differences in the perception and production of charismatic speech. We also extend the work to politicians' speech by collecting speaker trait ratings on representative speech segments of politicians and study how the genre, gender, and the rater's political stance influence the charisma ratings of the segments
Exploring traditional and metropolitan Indian arts using the Muggu tradition as a case study
The past century has witnessed fervent debates about dichotomies in Indian art,
articulated variously as high and low art, art and craft, and fine and decorative art. The current avatar of such dichotomies is expressed as a divide between metropolitan and traditional art. The former is understood to be that which is displayed and marketed in urban art institutions and associated with individualism; the latter is generally qualified by terms like folk, religious, ritual, rural or tribal, displayed and sold in non-institutional contexts and associated with a collective identity. Despite frequent attempts to resolve the above-mentioned dichotomies, such hierarchies persist. Indian art is currently experiencing a resurgence, which some see more as a by-product of a rapidly growing economy, rather than as an explicitly artistic maturing. Notwithstanding this recent boom, many writers and artists lament the state of Indian cultural institutions. One such critic is Rustom Bharucha, whose essay on Indian museums provides one of the starting points for this study.
The difficulty of reconciling the modern and the traditional appears to lie at the heart of these issues – a problem that both metropolitan and traditional artists face. In this project, I consider myself as an example of a metropolitan Indian artist and the issues I encountered as possibly characteristic of those that other metropolitan artists face. As a case study of traditional arts, I look at muggus, floor-drawings made by women in Andhra Pradesh, south India. Their ephemerality, ritualism and aesthetics furnish relevant instances for a discussion on metropolitan and traditional arts, challenging existing stereotypes and prejudices in the display, production and discourse of traditional arts. This study crosses the academic boundaries of anthropology, art-practice, art history, cultural theory, ethnography and visual culture to allow for a more layered exploration of Indian metropolitan and traditional arts
- …