Search CORE

34 research outputs found

Recommended from our members

Deep Learning for Automatic Assessment and Feedback of Spoken English

Author: Kyriakopoulos Konstantinos
Publication venue: University of Cambridge
Publication date: 12/03/2022
Field of study

Growing global demand for learning a second language (L2), particularly English, has led to considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications. This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One of the challenges in automatic spoken language assessment is giving candidates feedback on particular aspects, or views, of their spoken language proficiency, in addition to the overall holistic score normally provided. Another is detecting pronunciation and other types of errors at the word or utterance level and feeding them back to the learner in a useful way. It is usually difficult to obtain accurate training data with separate scores for different views and, as examiners are often trained to give holistic grades, single-view scores can suffer issues of consistency. Conversely, holistic scores are available for various standard assessment tasks such as Linguaskill. An investigation is thus conducted into whether assessment scores linked to particular views of the speaker’s ability can be obtained from systems trained using only holistic scores. End-to-end neural systems are designed with structures and forms of input tuned to single views, specifically each of pronunciation, rhythm, intonation and text. By training each system on large quantities of candidate data, individual-view information should be possible to extract. The relationships between the predictions of each system are evaluated to examine whether they are, in fact, extracting different information about the speaker. Three methods of combining the systems to predict holistic score are investigated, namely averaging their predictions and concatenating and attending over their intermediate representations. The combined graders are compared to each other and to baseline approaches. The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. An approach to these tasks is presented by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate’s speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora x of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level

Apollo (Cambridge)

Identifying prosodic prominence patterns for English text-to-speech synthesis

Author: Badino Leonardo
Publication venue: The University of Edinburgh
Publication date: 01/01/2010
Field of study

This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthesis by identifying and generating natural patterns of prosodic prominence. In most state-of-the-art TTS systems the prediction from text of prosodic prominence relations between words in an utterance relies on features that very loosely account for the combined effects of syntax, semantics, word informativeness and salience, on prosodic prominence. To improve prosodic prominence prediction we first follow up the classic approach in which prosodic prominence patterns are flattened into binary sequences of pitch accented and pitch unaccented words. We propose and motivate statistic and syntactic dependency based features that are complementary to the most predictive features proposed in previous works on automatic pitch accent prediction and show their utility on both read and spontaneous speech. Different accentuation patterns can be associated to the same sentence. Such variability rises the question on how evaluating pitch accent predictors when more patterns are allowed. We carry out a study on prosodic symbols variability on a speech corpus where different speakers read the same text and propose an information-theoretic definition of optionality of symbolic prosodic events that leads to a novel evaluation metric in which prosodic variability is incorporated as a factor affecting prediction accuracy. We additionally propose a method to take advantage of the optionality of prosodic events in unit-selection speech synthesis. To better account for the tight links between the prosodic prominence of a word and the discourse/sentence context, part of this thesis goes beyond the accent/no-accent dichotomy and is devoted to a novel task, the automatic detection of contrast, where contrast is meant as a (Information Structure’s) relation that ties two words that explicitly contrast with each other. This task is mainly motivated by the fact that contrastive words tend to be prosodically marked with particularly prominent pitch accents. The identification of contrastive word pairs is achieved by combining lexical information, syntactic information (which mainly aims to identify the syntactic parallelism that often activates contrast) and semantic information (mainly drawn from the Word- Net semantic lexicon), within a Support Vector Machines classifier. Once we have identified patterns of prosodic prominence we propose methods to incorporate such information in TTS synthesis and test its impact on synthetic speech naturalness trough some large scale perceptual experiments. The results of these experiments cast some doubts on the utility of a simple accent/no-accent distinction in Hidden Markov Model based speech synthesis while highlight the importance of contrastive accents

Edinburgh Research Archive

Speech Synthesis Based on Hidden Markov Models

Author: Nankaku Y.
Oura K.
Toda T.
Tokuda K.
Yamagishi J.
Zen H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2013
Field of study

Edinburgh Research Explorer

Computational modelling of coreference and bridging resolution

Author: Rösiger Ina
Publication venue
Publication date: 01/01/2019
Field of study

Unsupervised learning for text-to-speech synthesis

Author: Watts Oliver Samuel
Publication venue: The University of Edinburgh
Publication date: 02/07/2013
Field of study

This thesis introduces a general method for incorporating the distributional analysis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to produce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characterisation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manually labelled data or language-specific expertise. Results of a subjective evaluation are presented

Edinburgh Research Archive

Subsidia: Tools and Resources for Speech Sciences

Author: Lahoz-Bengoechea José María (Ed)
Pérez Ramón Rubén (Ed)
Publication venue
Publication date: 01/01/2019
Field of study

Este libro, resultado de la colaboración de investigadores expertos en sus respectivas áreas, pretende ser una ayuda a la comunidad científica en tanto en cuanto recopila y describe una serie de materiales de gran utilidad para seguir avanzando en la investigació

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Málaga

Recommended from our members

Turn-Taking and Affirmative Cue Words in Task-Oriented Dialogue

Author: Gravano Agustin
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 15/12/2001
Field of study

As interactive voice response systems spread at a rapid pace, providing an increasingly more complex functionality, it is becoming clear that the challenges of such systems are not solely associated to their synthesis and recognition capabilities. Rather, issues such as the coordination of turn exchanges between system and user, or the correct generation and understanding of words that may convey multiple meanings, appear to play an important role in system usability. This thesis explores those two issues in the Columbia Games Corpus, a collection of spontaneous task-oriented dialogues in Standard American English. We provide evidence of the existence of seven turn-yielding cues -- prosodic, acoustic and syntactic events strongly associated with conversational turn endings -- and show that the likelihood of a turn-taking attempt from the interlocutor increases linearly with the number of cues conjointly displayed by the speaker. We present similar results related to six backchannel-inviting cues -- events that invite the interlocutor to produce a short utterance conveying continued attention. Additionally, we describe a series of studies of affirmative cue words -- a family of cue words such as 'okay' or 'alright' that speakers use frequently in conversation for several purposes: for acknowledging what the interlocutor has said, or for cueing the start of a new topic, among others. We find differences in the acoustic/prosodic realization of such functions, but observe that contextual information figures prominently in human disambiguation of these words. We also conduct machine learning experiments to explore the automatic classification of affirmative cue words. Finally, we examine a novel measure of speaker entrainment related to the usage of these words, showing its association with task success and dialogue coordination

Columbia University Academic Commons

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Information structure and the prosodic structure of English : a probabilistic relationship

Author: Calhoun Sasha
Publication venue: The University of Edinburgh
Publication date: 27/06/2007
Field of study

This work concerns how information structure is signalled prosodically in English, that is, how prosodic prominence and phrasing are used to indicate the salience and organisation of information in relation to a discourse model. It has been standardly held that information structure is primarily signalled by the distribution of pitch accents within syntax structure, as well as intonation event type. However, we argue that these claims underestimate the importance, and richness, of metrical prosodic structure and its role in signalling information structure. We advance a new theory, that information structure is a strong constraint on the mapping of words onto metrical prosodic structure. We show that focus (kontrast) aligns with nuclear prominence, while other accents are not usually directly 'meaningful'. Information units (theme/rheme) try to align with prosodic phrases. This mapping is probabilistic, so it is also influenced by lexical and syntactic effects, as well as rhythmical constraints and other features including emphasis. Rather than being directly signalled by the prosody, the likelihood of each information structure interpretation is mediated by all these properties. We demonstrate that this theory resolves problematic facts about accent distribution in earlier accounts and makes syntactic focus projection rules unnecessary. Previous theories have claimed that contrastive accents are marked by a categorically distinct accent type to other focal accents (e.g. L+H* v H*). We show this distinction in fact involves two separate semantic properties: contrastiveness and theme/rheme status. Contrastiveness is marked by increased prominence in general. Themes are distinguished from rhemes by relative prominence, i.e. the rheme kontrast aligns with nuclear prominence at the level of phrasing that includes both theme and rheme units. In a series of production and perception experiments, we directly test our theory against previous accounts, showing that the only consistent cue to the distinction between theme and rheme nuclear accents is relative pitch height. This height difference accords with our understanding of the marking of nuclear prominence: theme peaks are only lower than rheme peaks in rheme-theme order, consistent with post-nuclear lowering; in theme-rheme order, the last of equal peaks is perceived as nuclear. The rest of the thesis involves analysis of a portion of the Switchboard corpus which we have annotated with substantial new layers of semantic (kontrast) and prosodic features, which are described. This work is an essentially novel approach to testing discourse semantics theories in speech. Using multiple regression analysis, we demonstrate distributional properties of the corpus consistent with our claims. Plain and nuclear accents are best distinguished by phrasal features, showing the strong constraint of phrase structure on the perception of prominence. Nuclear accents can be reliably predicted by semantic/syntactic features, particularly kontrast, while other accents cannot. Plain accents can only be identified well by acoustic features, showing their appearance is linked to rhythmical and low-level semantic features. We further show that kontrast is not only more likely in nuclear position, but also if a word is more structurally or acoustically prominent than expected given its syntactic/information status properties. Consistent with our claim that nuclear accents are distinctive, we show that pre-, post- and nuclear accents have different acoustic profiles; and that the acoustic correlates of increased prominence vary by accent type, i.e. pre-nuclear or nuclear. Finally, we demonstrate the efficacy of our theory compared to previous accounts using examples from the corpus

Edinburgh Research Archive

Recommended from our members

Turn-Taking and Affirmative Cue Words in Task-Oriented Dialogue

Author: Gravano Agustin
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2009
Field of study

Columbia University Academic Commons

ProQuest OAI Repository

Prosodic prominence in English

Author: Im Suyeon
Publication venue
Publication date: 01/12/2018
Field of study

In English, certain words are perceptually more salient than other neighboring words. The perceptual salience is signaled by acoustic cues. Prominent words are higher, longer, or louder than nonprominent words in English. Perceptual prominence is associated with meaning of a word in discourse context. Prominent words are usually new or contrastive information, while nonprominent words are given or noncontrastive information. This dissertation addresses English prominence in two separate studies. The first study investigates the prosodic prominence in relation to pitch accents, acoustic cues, and discourse meaning of a word in a public speech. The second study examines the cognitive representation of prosodic contour in a corpus of imitation. Linguists claim that the information status of a word determines the types of pitch accents in English. Prior research informs us about prominence (1) in relation to the binary given-new distinction of lexical givenness, and (2) in minimally contextualized utterances such as question-answer prompts or excerpts from a corpus. The assignment of prominence, however, can vary in relation to referential meaning as well as lexical meaning of a word in natural, more contextualized speech. This study examines the prosodic prominence as a function of pitch accents, acoustic cues, and information status in a complete public speech. Information status is considered in relation to referential, lexical givenness and alternative-based contrastive focus. The results show that accent type is probabilistically associated with information status in this speech. The accent assignment differs between referentially vs. lexically given words. Despite the weak relationship between information status and pitch accents in the speech of the speaker, non-expert listeners perceive prominence as expected: they are more likely to perceive prominence on words carrying new or contrastive information or words with high or bitonal pitch accents. Surprisingly, the listeners perceive acoustic cues differently depending on the information status or accent types of a word. Based on these results, the first study suggests that (1) the relationship between information status and accent type is not deterministic in English, (2) lexical givenness differs from referential givenness in production and perception of prominence, and (3) perceived prominence is influenced by information status, pitch accents, acoustic cues, and their interaction. The second study examines how an intonational contour is represented in the mental lexicon of English speakers. Some linguists find that speakers are able to reproduce the phonetic details of intonational features, while in other research speakers are better at reproducing intonational features than imitating phonetic details of an utterance. This study investigates the domain of intonational encoding by comparing several prosodic domains in imitated utterances. I hypothesize that the domain which best captures the similarity of intonational contour between the model speaker and imitators is the target of imitation, and that imitation can be considered as the domain of intonational encoding in cognitive representation. The results show that the f0 distance between the model speaker and imitators is best explained over an intermediate phrase. Based on these results, the second study proposes that speakers encode a time-varying f0 contour over a prosodic phrase in their mental lexicon and supports the exemplar encoding of intonational contour

Illinois Digital Environment for Access to Learning and Scholarship Repository