1,851 research outputs found

    Developing Sparse Representations for Anchor-Based Voice Conversion

    Get PDF
    Voice conversion is the task of transforming speech from one speaker to sound as if it was produced by another speaker, changing the identity while retaining the linguistic content. There are many methods for performing voice conversion, but oftentimes these methods have onerous training requirements or fail in instances where one speaker has a nonnative accent. To address these issues, this dissertation presents and evaluates a novel “anchor-based” representation of speech that separates speaker content from speaker identity by modeling how speakers form English phonemes. We call the proposed method Sparse, Anchor-Based Representation of Speech (SABR), and explore methods for optimizing the parameters of this model in native-to-native and native-to-nonnative voice conversion contexts. We begin the dissertation by demonstrating how sparse coding in combination with a compact, phoneme-based dictionary can be used to separate speaker identity from content in objective and subjective tests. The formulation of the representation then presents several research questions. First, we propose a method for improving the synthesis quality by using the sparse coding residual in combination with a frequency warping algorithm to convert the residual from the source to target speaker’s space, and add it to the target speaker’s estimated spectrum. Experimentally, we find that synthesis quality is significantly improved via this transform. Second, we propose and evaluate two methods for selecting and optimizing SABR anchors in native-to-native and native-to-nonnative voice conversion. We find that synthesis quality is significantly improved by the proposed methods, especially in native-to- nonnative voice conversion over baseline algorithms. In a detailed analysis of the algorithms, we find they focus on phonemes that are difficult for nonnative speakers of English or naturally have multiple acoustic states. Following this, we examine methods for adding in temporal constraints to SABR via the Fused Lasso. The proposed method significantly reduces the inter-frame variance in the sparse codes over other methods that incorporate temporal features into sparse coding representations. Finally, in a case study, we examine the use of the SABR methods and optimizations in the context of a computer aided pronunciation training system for building “Golden Speakers”, or ideal models for nonnative speakers of a second language to learn correct pronunciation. Under the hypothesis that the optimal “Golden Speaker” was the learner’s voice, synthesized with a native accent, we used SABR to build voice models for nonnative speakers and evaluated the resulting synthesis in terms of quality, identity, and accentedness. We found that even when deployed in the field, the SABR method generated synthesis with low accentedness and similar acoustic identity to the target speaker, validating the use of the method for building “golden speakers”

    Robust speech recognition with spectrogram factorisation

    Get PDF
    Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

    The construction of a linguistic linked data framework for bilingual lexicographic resources

    Get PDF
    Little-known lexicographic resources can be of tremendous value to users once digitised. By extending the digitisation efforts for a lexicographic resource, converting the human readable digital object to a state that is also machine-readable, structured data can be created that is semantically interoperable, thereby enabling the lexicographic resource to access, and be accessed by, other semantically interoperable resources. The purpose of this study is to formulate a process when converting a lexicographic resource in print form to a machine-readable bilingual lexicographic resource applying linguistic linked data principles, using the English-Xhosa Dictionary for Nurses as a case study. This is accomplished by creating a linked data framework, in which data are expressed in the form of RDF triples and URIs, in a manner which allows for extensibility to a multilingual resource. Click languages with characters not typically represented by the Roman alphabet are also considered. The purpose of this linked data framework is to define each lexical entry as “historically dynamic”, instead of “ontologically static” (Rafferty, 2016:5). For a framework which has instances in constant evolution, focus is thus given to the management of provenance and linked data generation thereof. The output is an implementation framework which provides methodological guidelines for similar language resources in the interdisciplinary field of Library and Information Science

    Improving reading: a handbook for improving reading in key stages 3 and 4 (National Strategies: secondary)

    Get PDF
    "This handbook explores what it means to be a reader and some core challenges and skills that need to be addressed in the teaching of reading. The handbook outlines a route to improvement that can be followed to ensure that all pupils make expected levels of progress so that they can become skilled and independent readers. Detailed guidance is provided for each stage of the improvement process: gathering and analysing information; writing the improvement plan; evaluating planning, approaches to teaching and learning and the assessment of reading. Subject leaders can decide which stages of the process their department is confident with and which areas need to be developed further. Each section provides relevant resources and tools to guide and support this work." - National Strategies website

    Recognizing Speech in a Novel Accent: The Motor Theory of Speech Perception Reframed

    Get PDF
    The motor theory of speech perception holds that we perceive the speech of another in terms of a motor representation of that speech. However, when we have learned to recognize a foreign accent, it seems plausible that recognition of a word rarely involves reconstruction of the speech gestures of the speaker rather than the listener. To better assess the motor theory and this observation, we proceed in three stages. Part 1 places the motor theory of speech perception in a larger framework based on our earlier models of the adaptive formation of mirror neurons for grasping, and for viewing extensions of that mirror system as part of a larger system for neuro-linguistic processing, augmented by the present consideration of recognizing speech in a novel accent. Part 2 then offers a novel computational model of how a listener comes to understand the speech of someone speaking the listener's native language with a foreign accent. The core tenet of the model is that the listener uses hypotheses about the word the speaker is currently uttering to update probabilities linking the sound produced by the speaker to phonemes in the native language repertoire of the listener. This, on average, improves the recognition of later words. This model is neutral regarding the nature of the representations it uses (motor vs. auditory). It serve as a reference point for the discussion in Part 3, which proposes a dual-stream neuro-linguistic architecture to revisits claims for and against the motor theory of speech perception and the relevance of mirror neurons, and extracts some implications for the reframing of the motor theory

    Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora

    Get PDF
    Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior knowledge broad-coverage lexical resources; the SALMA – ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA –Tag Set is a theory standard for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent. The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging

    Large Scale Generative AI Text Applied to Sports and Music

    Full text link
    We address the problem of scaling up the production of media content, including commentary and personalized news stories, for large-scale sports and music events worldwide. Our approach relies on generative AI models to transform a large volume of multimodal data (e.g., videos, articles, real-time scoring feeds, statistics, and fact sheets) into coherent and fluent text. Based on this approach, we introduce, for the first time, an AI commentary system, which was deployed to produce automated narrations for highlight packages at the 2023 US Open, Wimbledon, and Masters tournaments. In the same vein, our solution was extended to create personalized content for ESPN Fantasy Football and stories about music artists for the Grammy awards. These applications were built using a common software architecture achieved a 15x speed improvement with an average Rouge-L of 82.00 and perplexity of 6.6. Our work was successfully deployed at the aforementioned events, supporting 90 million fans around the world with 8 billion page views, continuously pushing the bounds on what is possible at the intersection of sports, entertainment, and AI.Comment: 9 pages, 8 figures, 5 table

    Gradient Metaphoricity of the Preposition in: A Corpus-based Approach to Chinese Academic Writing in English

    Get PDF
    In Cognitive Linguistics, a conceptual metaphor is a systematic set of correspondences between two domains of experience (Kövecses 2020: 2). In order to have an extensive understanding of metaphors, metaphoricity (Müller and Tag 2010; Dunn 2011; Jensen and Cuffari 2014; Nacey and Jensen 2017) has been emphasized to address one of the properties of metaphors in language usage: gradience (Hanks 2006; Dunn 2011, 2014), which indicates that metaphorical expressions can be measured. Despite many noteworthy contributions, studies of metaphoricity are often accused of subjectivity (Müller 2008; Jensen and Cuffari 2014; Jensen 2017), this is why this study uses a big corpus as a database. Therefore, the main aim of this dissertation is to measure the gradient senses of the preposition in in an objective way, thus mapping the highly systematic semantic extension. Based on these gradient senses, the semantic and syntactic features of the preposition in produced by advanced Chinese English-major learners are investigated, combining quantitative and qualitative research methods. A quantitative analysis of the literal and other ten metaphorical senses of the preposition in is made at first. In accounting for the five factors influencing image schemata of each sense: “scale of Landmark”, “visibility”, “path”, “inclusion” and “boundary”, the formula of measuring the gradability of metaphorical degree is deduced: Metaphoricity=[[#Visibility] +[#Path] +[#Inclusion] +[#Boundary]]*[#Scale of Landmark]. The result is that the primary sense has the highest value:12, and all other extended senses have values down to zero. The more shared features with proto-scene, the higher the value of the metaphorical sense, and the less metaphorical the sense. EVENT and PERSON are the “least metaphoric” (value = 9-11); SITUATION, NUMBER, CONTENT and FIELD are “weak metaphoric” (value = 6-8); Also included are SEGMENTATION, TIME and MANNER (value = 3-5), and they are “strong metaphoric”; PURPOSE shares the least feature with proto-scene, and it has the lowest value, so it is “most metaphoric” (value = 0-2). Then, a corpus-based approach is employed, which offers a model for employing a corpus-based approach in Cognitive Linguistics. It compares two compiled sub-corpora: Chinese Master Academic Writing Corpus and Chinese Doctorate Academic Writing Corpus. The findings show that, on the semantic level, Chinese English-major students overuse in with a low level of metaphoricity, even advanced learners use the most metaphorical in rarely. In terms of syntactic behaviours, the most frequent nouns in [in+noun] construction are weakly metaphoric, whilst the nouns in the construction [in the noun of] are EVENT sense, which is least metaphorical. Moreover, action verbs tend to be used in the construction [verb+in] and [in doing sth.] in both master and doctorate groups. In the qualitative study, the divergent usages of the preposition in are explored. The preposition in is often substituted with other prepositions, such as on and at. The fundamental reason for the Chinese learners’ weakness is the negative transfer from their mother tongue (Wang 2001; Gong 2007; Zhang 2010). Although in and its Chinese equivalence zai...li (在...里) share the same proto-scene, there are discrepancies: the metaphorical senses of the preposition in are TIME, PURPOSE, NUMBER, CONTENT, FIELD, EVENT, SITUATION, SEGMENTATION, MANNER, PERSON, while those of zai...li (在...里) are only five: TIME, CONTENT, EVENT, SITUATION and PERSON. Thus the image schemata of each sense cannot be correspondingly mapped onto each other in different languages. This study also provides evidence for the universality and variation of spatial metaphors on the ground of cultural models. Philosophically, it supports the standpoint of Embodiment philosophy that abstract concepts are constructed on the basis of spatial metaphors that are grounded in the physical and cultural experience
    • …
    corecore