40 research outputs found

    Modeling Improved Prosody Generation from High-Level Linguistically Annotated Corpora

    No full text

    Augmented auditory representation of e-texts for text-to-speech systems

    No full text
    Emerging electronic text formats include hierarchical structure and visualization related information that current Text-to-Speech (TtS) systems ignore. In this paper we present a novel approach for composing detailed auditory representation of e-texts using speech and audio. Furthermore, we provide a scripting language (CAD scripts) for defining specific customizations on the operation of a TtS. CAD scripts can be assigned as well to specific text meta-data to enable their discrete auditory representation. This approach can form a mean for a detailed exchange of functionality across different TtS implementations. Moreover, it can be hosted to current TtS systems with minor (or major) modifications. Finally, we briefly present the implementation of DEMOSTHeNES Composer for augmented auditory generation of meta-text using the above methodology. © Springer-Verlag Berlin Heidelberg 2001

    Tone-Group F0 selection for modeling focus prominence in small-footprint speech synthesis

    No full text
    This work targets to improve the naturalness of synthetic intonational contours in Text-to-Speech synthesis through the provision of prominence, which is a major expression of human speech. Focusing on the tonal dimension of emphasis, we present a robust unit-selection methodology for generating realistic F0 curves in cases where focus prominence is required. The proposed approach is based on selecting Tone-Group units from commonly used prosodic corpora that are automatically transcribed as patterns of syllables. In contrast to related approaches, patterns represent only the most perceivable sections of the sampled curves and are encoded to serve morphologically different sequence of syllables. This results in a minimization of the required amount of units so as to achieve sufficient coverage within the database. Nevertheless, this optimization enables the application of high-quality F0 generation to small-footprint text-to-speech synthesis. For generic F0 selection we query the database based on sequences of ToBI labels, though other intonational frameworks can be used as well. To realize focus prominence on specific Tone-Groups the selection also incorporates a level indicator of emphasis. We set up a series of listening tests by exploiting a database built from a 482-utterance corpus, which featured partially purpose-uttered emphasis. The results showed a clear subjective preference of the proposed model against a linear regression one in 75% of the cases when used in generic synthesis. Furthermore, this model provided ambiguous percept of emphasis in an experiment featuring major and minor degrees of prominence. © 2006 Elsevier B.V. All rights reserved

    Diction based prosody modeling in table-to-speech synthesis

    No full text
    Transferring a structure from the visual modality to the aural one presents a difficult challenge. In this work we are experimenting with prosody modeling for the synthesized speech representation of tabulated structures. This is achieved by analyzing naturally spoken descriptions of data tables and a following feedback by blind and sighted users. The derived prosodic phrase accent and pause break placement and values are examined in terms of successfully conveying semantically important visual information through prosody control in Table-to-Speech synthesis. Finally, the quality of the information provision of synthesized tables when utilizing the proposed prosody specification is studied against plain synthesis. © Springer-Verlag Berlin Heidelberg 2005

    Modeling improved prosody generation from high-level linguistically annotated corpora

    No full text
    Synthetic speech usually suffers from bad F0 contour surface. The prediction of the underlying pitch targets robustly relies on the quality of the predicted prosodic structures, i.e. the corresponding sequences of tones and breaks. In the present work, we have utilized a linguistically enriched annotated corpus to build data-driven models for predicting prosodic structures with increased accuracy. We have then used a linear regression approach for the F0 modeling. An appropriate XML annotation scheme has been introduced to encode syntax, grammar, new or already given information, phrase subject/object information, as well as rhetorical elements in the corpus, by exploiting a Natural Language Generator (NLG) system. To prove the benefits from the introduction of the enriched input meta-information, we first show that while tone and break CART predictors have high accuracy when standing alone (92.35% for breaks, 87.76% for accents and 99.03% for endtones), their application in the TtS chain degrades the Linear Regression pitch target model. On the other hand, the enriched linguistic meta-information minimizes errors of models leading to a more natural F0 surface. Both objective and subjective evaluation were adopted for the intonation contours by taking into account the propagated errors introduced by each model in the synthesis chain. Copyright © 2005 The Institute of Electronics, Information and Communication Engineers

    Auditory accessibility of metadata in books: A design for all approach

    No full text
    There are two issues that are challenging in the life-cycle of Digital Talking Books (DTB): the automatic labeling of text formatting meta-data in documents and the multimodal representation of the text formatting semantics. We propose an augmented design-for-all approach for both the production and the reading processes of DAISY compliant DTBs. This approach incorporates a methodology for the real-time extraction and the semantic labeling of text formatting meta-data. Furthermore, it includes a unified approach for the multimodal rendering of text formatting, structure and layout meta-data by utilizing a Document-to-Audio platform to render the acoustic modality. © Springer-Verlag Berlin Heidelberg 2007

    Augmented Auditory Representation of e-Texts for Text-to-Speech Systems

    No full text
    Emerging electronic text formats include hierarchical structure and visualization related information that current Text-to-Speech (TtS)syS)Rfl ignore. In this paper we present a novel approach for composing detailedauditory representation of e-texts using speech and audio. Furthermore, we provide a scripting language (CAD scripts) for defining specific customizations on the operation of a TtS. CAD scripts can be assigned as well to specific text meta-data to enable their discreteauditory representation. This approach can form a mean for a detailed exchange offunctionality across different TtS implementations. Moreover, it can be hosted to current TtSsyRjOk with minor (or major) modifications.Finally , we briefly present the implementation of DEMOSTHeNES Composer for augmented auditory generation of meta-text using the abovemethodology . 1 IntrotrE173 In the near past, the research focus of Text-to-Speech (TtS)syS)RflV concerning e-text handling, wasmainly the extraction of linguistic information from rather plain e-texts in order to renderprosody . During the lastystRj the format and the visual representation of e-texts have changed, introducing emerging issues in speech generation. Various e-text tyOk (like HTML,MS-Word and XML) provide on the one hand text information in wellformed hierarchical structures and on the other optional information about the way the text will be visualized. Furthermore, in many other cases e-text has not a continuous flow with a propersyerRMkkjL structure, as the use of written telegraphicstyh has increased within documents. For example, an HTML document contains titles and legends in buttons and photos.Finally , a huge amount of e-text is stored in databases. Queries to these databases return differenttye of texts that have well-defined..
    corecore