127 research outputs found

    Study on phonetic context of Malay syllables towards the development of Malay speech synthesizer [TK7882.S65 H233 2007 f rb].

    Get PDF
    Pensintesis sebutan Bahasa Melayu telah berkembang daripada teknik pensintesis berparameter (pemodelan penyebutan manusia dan pensintesis berdasarkan formant) kepada teknik pensintesis tidak berparameter (pensintesis sebutan berdasarkan pencantuman). Speech synthesizer has evolved from parametric speech synthesizer (articulatory and formant synthesizer) to non-parametric synthesizer (concatenative synthesizer). Recently, the concatenative speech synthesizer approach is moving towards corpusbased or unit selection technique

    Prosody in text-to-speech synthesis using fuzzy logic

    Get PDF
    For over a thousand years, inventors, scientists and researchers have tried to reproduce human speech. Today, the quality of synthesized speech is not equivalent to the quality of real speech. Most research on speech synthesis focuses on improving the quality of the speech produced by Text-to-Speech (TTS) systems. The best TTS systems use unit selection-based concatenation to synthesize speech. However, this method is very timely and the speech database is very large. Diphone concatenated synthesized speech requires less memory, but sounds robotic. This thesis explores the use of fuzzy logic to make diphone concatenated speech sound more natural. A TTS is built using both neural networks and fuzzy logic. Text is converted into phonemes using neural networks. Fuzzy logic is used to control the fundamental frequency for three types of sentences. In conclusion, the fuzzy system produces f0 contours that make the diphone concatenated speech sound more natural

    Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community

    Time-domain concatenative text-to-speech synthesis.

    Get PDF
    A concatenation framework for time-domain concatenative speech synthesis (TDCSS) is presented and evaluated. In this framework, speech segments are extracted from CV, VC, CVC and CC waveforms, and abutted. Speech rhythm is controlled via a single duration parameter, which specifies the initial portion of each stored waveform to be output. An appropriate choice of segmental durations reduces spectral discontinuity problems at points of concatenation, thus reducing reliance upon smoothing procedures. For text-to-speech considerations, a segmental timing system is described, which predicts segmental durations at the word level, using a timing database and a pattern matching look-up algorithm. The timing database contains segmented words with associated duration values, and is specific to an actual inventory of concatenative units. Segmental duration prediction accuracy improves as the timing database size increases. The problem of incomplete timing data has been addressed by using `default duration' entries in the database, which are created by re-categorising existing timing data according to articulation manner. If segmental duration data are incomplete, a default duration procedure automatically categorises the missing speech segments according to segment class. The look-up algorithm then searches the timing database for duration data corresponding to these re-categorised segments. The timing database is constructed using an iterative synthesis/adjustment technique, in which a `judge' listens to synthetic speech and adjusts segmental durations to improve naturalness. This manual technique for constructing the timing database has been evaluated. Since the timing data is linked to an expert judge's perception, an investigation examined whether the expert judge's perception of speech naturalness is representative of people in general. Listening experiments revealed marked similarities between an expert judge's perception of naturalness and that of the experimental subjects. It was also found that the expert judge's perception remains stable over time. A synthesis/adjustment experiment found a positive linear correlation between segmental durations chosen by an experienced expert judge and duration values chosen by subjects acting as expert judges. A listening test confirmed that between 70% and 100% intelligibility can be achieved with words synthesised using TDCSS. In a further test, a TDCSS synthesiser was compared with five well-known text-to-speech synthesisers, and was ranked fifth most natural out of six. An alternative concatenation framework (TDCSS2) was also evaluated, in which duration parameters specify both the start point and the end point of the speech to be extracted from a stored waveform and concatenated. In a similar listening experiment, TDCSS2 stimuli were compared with five well-known text-tospeech synthesisers, and were ranked fifth most natural out of six

    Acoustic correlates of linguistic rhythm: Perspectives

    Get PDF
    The empirical grounding of a typology of languages' rhythm is again a hot issue. The currently popular approach is based on the durations of vocalic and intervocalic intervals and their variability. Despite some successes, many questions remain. The main findings still need to be generalised to much larger corpora including many more languages. But a straightforward continuation of the current work faces many difficulties. Perspectives are outlined for future work, including proposals for the cross-linguistic control of speech rate, improvements on the statistical analyses, and prospects raised by automatic speech processing

    Segmental and prosodic improvements to speech generation

    Get PDF

    The "Tiepstem" : an experimental Dutch keyboard-to-speech system for the speech impaired

    Get PDF
    An experimental Dutch keyboard-to-speech system has been developed to explor the possibilities and limitations of Dutch speech synthesis in a communication aid for the speech impaired. The system uses diphones and a formant synthesizer chip for speech synthesis. Input to the system is in pseudo-phonetic notation. Intonation contours using a declination line and various rises and falls are generated starting from an input consisting of punctuation and accent marks. The hardware design has resulted in a small, portable and battery-powered device. A short evaluation with users has been carried out, which has shown possibilities for such a device but has also indicated some problems with the current pseudo-phonetic input

    Articulatory-based Speech Processing Methods for Foreign Accent Conversion

    Get PDF
    The objective of this dissertation is to develop speech processing methods that enable without altering their identity. We envision accent conversion primarily as a tool for pronunciation training, allowing non-native speakers to hear their native-accented selves. With this application in mind, we present two methods of accent conversion. The first assumes that the voice quality/identity of speech resides in the glottal excitation, while the linguistic content is contained in the vocal tract transfer function. Accent conversion is achieved by convolving the glottal excitation of a non-native speaker with the vocal tract transfer function of a native speaker. The result is perceived as 60 percent less accented, but it is no longer identified as the same individual. The second method of accent conversion selects segments of speech from a corpus of non-native speech based on their acoustic or articulatory similarity to segments from a native speaker. We predict that articulatory features provide a more speaker-independent representation of speech and are therefore better gauges of linguistic similarity across speakers. To test this hypothesis, we collected a custom database containing simultaneous recordings of speech and the positions of important articulators (e.g. lips, jaw, tongue) for a native and non-native speaker. Resequencing speech from a non-native speaker based on articulatory similarity with a native speaker achieved a 20 percent reduction in accent. The approach is particularly appealing for applications in pronunciation training because it modifies speech in a way that produces realistically achievable changes in accent (i.e., since the technique uses sounds already produced by the non-native speaker). A second contribution of this dissertation is the development of subjective and objective measures to assess the performance of accent conversion systems. This is a difficult problem because, in most cases, no ground truth exists. Subjective evaluation is further complicated by the interconnected relationship between accent and identity, but modifications of the stimuli (i.e. reverse speech and voice disguises) allow the two components to be separated. Algorithms to measure objectively accent, quality, and identity are shown to correlate well with their subjective counterparts

    Concatenative speech synthesis : a framework for reducing perceived distortion when using the TD-PSOLA algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore