179,834 research outputs found

    A database and digital signal processing framework for the perceptual analysis of voice quality

    Get PDF
    Bermúdez de Alvear RM, Corral J, Tardón LJ, Barbancho AM, Fernández Contreras E, Rando Márquez S, Martínez-Arquero AG, Barbancho I. A database and digital signal processing framework for the perceptual analysis of voice quality. Pan European Voice Conferenc: PEVOC 11 Abstract Book. Aug. 31-Sept.2, 2015.Introduction. Clinical assessment of dysphonia relies on perceptual as much as instrumental methods of analysis [1]. The perceptual auditory analysis is potentially subject to several internal and external sources of bias [2]. Furthermore acoustic analyses which have been used to objectively characterize pathological voices are likely to be affected by confusion variables such as the signal processing or the hardware and software specifications [3]. For these reasons the poor correlation between perceptual ratings and acoustic measures remains to be a controversial matter [4]. The availability of annotated databases of voice samples is therefore of main importance for clinical and research purposes. Databases to perform digital processing of the vocal signal are usually built from English speaking subjects’ sustained vowels [5]. However phonemes vary from one language to another and to the best of our knowledge there are no annotated databases with Spanish sustained vowels from healthy or dysphonic voices. This work shows our first steps to fill in this gap. For the aim of aiding clinicians and researchers in the perceptual assessment of voice quality a two-fold objective was attained. On the one hand a database of healthy and disordered Spanish voices was developed; on the other an automatic analysis scheme was accomplished on the basis of signal processing algorithms and supervised learning machine techniques. Material and methods. A preliminary annotated database was created with 119 recordings of the sustained Spanish /a/; they were perceptually labeled by three experienced experts in vocal quality analysis. It is freely available under Links in the ATIC website (www.atic.uma.es). Voice signals were recorded using a headset condenser cardioid microphone (AKG C-544 L) positioned at 5 cm from the speaker’s mouth commissure. Speakers were instructed to sustain the Spanish vowel /a/ for 4 seconds. The microphone was connected to a digital recorder Edirol R-09HR. Voice signals were digitized at 16 bits with 44100 Hz sampling rate. Afterwards the initial and last 0.5 second segments were cut and the 3 sec. mid portion was selected for acoustic analysis. Sennheiser HD219 headphones were used by judges to perceptually evaluate voice samples. To label these recordings raters used the Grade-Roughness-Breathiness (GRB) perceptual scale which is a modified version of the original Hirano’s GRBAS scale, posteriorly modified by Dejonckere et al., [6]. In order to improve intra- and inter-raters’ agreement two types of modifications were introduced in the rating procedure, i.e. the 0-3 points scale resolution was increased by adding subintervals to the standard 0-3 intervals, and judges were provided with a written protocol with explicit definitions about the subintervals boundaries. By this way judges could compensate for the potential instability that might occur in their internal representations due to the perceptual context influence [7]. Raters’ perceptual evaluations were simultaneously performed by means of connecting the Sennheiser HD219 headphones to a multi-channel headphone preamp Behringer HA4700 Powerplay Pro-XL. The Yin algorithm [8] was selected as initial front-end to identify voiced frames and extract their fundamental frequency. For the digital processing of voice signals some conventional acoustic parameters [6] were selected. To complete the analysis the Mel-Frequency Cepstral Coefficients (MFCC) were further calculated because they are based on the auditory model and they are thus closer to the auditory system response than conventional features. Results. In the perceptual evaluation excellent intra-raters agreement and very good inter-raters agreement were achieved. During the supervised machine learning stage some conventional features were found to attain unexpected low performance in the classification scheme selected. Mel Frequency Cepstral Coefficients were promising for assorting samples with normal or quasi-normal voice quality. Discussion and conclusions. Despite it is still small and unbalanced the present annotated data base of voice samples can provide a basis for the development of other databases and automatic classification tools. Other authors [9, 10, 11] also found that modeling the auditory non-linear response during signal processing can help develop objective measures that better correspond with perceptual data. However highly disordered voices classification remains to be a challenge for this set of features since they cannot be correctly assorted by either conventional variables or the auditory model based measures. Current results warrant further research in order to find out the usability of other types of voice samples and features for the automatic classification schemes. Different digital processing steps could be used to improve the classifiers performance. Additionally other types of classifiers could be taken into account in future studies. Acknowledgment. This work was funded by the Spanish Ministerio de Economía y Competitividad, Project No. TIN2013-47276-C6-2-R has been done in the Campus de Excelencia Internacional Andalucía Tech, Universidad de Málaga. References [1] Carding PN, Wilson JA, MacKenzie K, Deary IJ. Measuring voice outcomes: state of the science review. The Journal of Laryngology and Otology 2009;123,8:823-829. [2] Oates J. Auditory-perceptual evaluation of disordered voice quality: pros, cons and future directions. Folia Phoniatrica et Logopaedica 2009;61,1:49-56. [3] Maryn et al. Meta-analysis on acoustic voice quality measures. J Acoust Soc Am 2009; 126, 5: 2619-2634. [4] Vaz Freitas et al. Correlation Between Acoustic and Audio-Perceptual Measures. J Voice 2015;29,3:390.e1 [5] “Multi-Dimensional Voice Program (MDVP) Model 5105. Software Instruction Manual”, Kay PENTAX, A Division of PENTAX Medical Company, 2 Bridgewater Lane, Lincoln Park, NJ 07035-1488 USA, November 2007. [6] Dejonckere PH, Bradley P, Clemente P, Cornut G, Crevier-Buchman L, Friedrich G, Van De Heyning P, Remacle M, Woisard V. A basic protocol for functional assessment of voice pathology, especially for investigating the efficacy of (phonosurgical) treatments and evaluating new assessment techniques. Guideline elaborated by the Comm. on Phoniatrics of the European Laryngological Society (ELS). Eur Arch Otorhinolaryngol 2001;258:77–82. [7] Kreiman et al. Voice Quality Perception. J Speech Hear Res 1993;36:21-4 [8] De Cheveigné A, Kawahara H. YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Amer. 202; 111,4:1917. [9] Shrivastav et al. Measuring breathiness. J Acoust Soc Am 2003;114,4:2217-2224. [10] Saenz-Lechon et al. Automatic Assessment of voice quality according to the GRBAS scale. Eng Med Biol Soc Ann 2006;1:2478-2481. [11] Fredouille et al. Back-and-forth methodology for objective voice quality assessment: from/to expert knowledge to/from automatic classification of dysphonia. EURASIP J Appl Si Pr 2009.Campus de Excelencia Internacional Andalucía Tech, Universidad de Málaga. Ministerio de Economía y Competitividad, Projecto No. TIN2013-47276-C6-2-R

    F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

    Full text link
    Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Many style-transfer-inspired methods such as generative adversarial networks (GANs) and variational autoencoders (VAEs) have been proposed. Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice. However, we found that while speaker identity is disentangled from speech content, a significant amount of prosodic information, such as source F0, leaks through the bottleneck, causing target F0 to fluctuate unnaturally. Furthermore, AutoVC has no control of the converted F0 and thus unsuitable for many applications. In the paper, we modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time. Therefore, we can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity. We support our improvement through quantitative and qualitative analysis

    A novel framework for high-quality voice source analysis and synthesis

    Get PDF
    The analysis, parameterization and modeling of voice source estimates obtained via inverse filtering of recorded speech are some of the most challenging areas of speech processing owing to the fact humans produce a wide range of voice source realizations and that the voice source estimates commonly contain artifacts due to the non-linear time-varying source-filter coupling. Currently, the most widely adopted representation of voice source signal is Liljencrants-Fant's (LF) model which was developed in late 1985. Due to the overly simplistic interpretation of voice source dynamics, LF model can not represent the fine temporal structure of glottal flow derivative realizations nor can it carry the sufficient spectral richness to facilitate a truly natural sounding speech synthesis. In this thesis we have introduced Characteristic Glottal Pulse Waveform Parameterization and Modeling (CGPWPM) which constitutes an entirely novel framework for voice source analysis, parameterization and reconstruction. In comparative evaluation of CGPWPM and LF model we have demonstrated that the proposed method is able to preserve higher levels of speaker dependant information from the voice source estimates and realize a more natural sounding speech synthesis. In general, we have shown that CGPWPM-based speech synthesis rates highly on the scale of absolute perceptual acceptability and that speech signals are faithfully reconstructed on consistent basis, across speakers, gender. We have applied CGPWPM to voice quality profiling and text-independent voice quality conversion method. The proposed voice conversion method is able to achieve the desired perceptual effects and the modified speech remained as natural sounding and intelligible as natural speech. In this thesis, we have also developed an optimal wavelet thresholding strategy for voice source signals which is able to suppress aspiration noise and still retain both the slow and the rapid variations in the voice source estimate.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Voice source characterization for prosodic and spectral manipulation

    Get PDF
    The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase. In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters reported in the literature, complemented with our own results from the vowel database. The results show that our method gives satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good). Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in order to achieve quality levels similar to the reference methods. As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters extracted using our algorithm have a positive impact in the field of automatic emotion classification

    Factors Affecting the Performance of Automated Speaker Verification in Alzheimer's Disease Clinical Trials

    Full text link
    Detecting duplicate patient participation in clinical trials is a major challenge because repeated patients can undermine the credibility and accuracy of the trial's findings and result in significant health and financial risks. Developing accurate automated speaker verification (ASV) models is crucial to verify the identity of enrolled individuals and remove duplicates, but the size and quality of data influence ASV performance. However, there has been limited investigation into the factors that can affect ASV capabilities in clinical environments. In this paper, we bridge the gap by conducting analysis of how participant demographic characteristics, audio quality criteria, and severity level of Alzheimer's disease (AD) impact the performance of ASV utilizing a dataset of speech recordings from 659 participants with varying levels of AD, obtained through multiple speech tasks. Our results indicate that ASV performance: 1) is slightly better on male speakers than on female speakers; 2) degrades for individuals who are above 70 years old; 3) is comparatively better for non-native English speakers than for native English speakers; 4) is negatively affected by clinician interference, noisy background, and unclear participant speech; 5) tends to decrease with an increase in the severity level of AD. Our study finds that voice biometrics raise fairness concerns as certain subgroups exhibit different ASV performances owing to their inherent voice characteristics. Moreover, the performance of ASV is influenced by the quality of speech recordings, which underscores the importance of improving the data collection settings in clinical trials.Comment: Accepted to the 5th Clinical Natural Language Processing Workshop (ClinicalNLP) at ACL 202

    Development of a Two-Level Warping Algorithm and Its Application to Speech Signal Processing

    Get PDF
    In many different fields there are signals that need to be aligned or “warped” in order to measure the similarity between them. When two time signals are compared, or when a pattern is sought in a larger stream of data, it may be necessary to warp one of the signals in a nonlinear way by compressing or stretching it to fit the other. Simple point-to-point comparison may give inadequate results, because one part of the signal might be comparing different relative parts of the other signal/pattern. Such cases need some sort of alignment todo the comparison. Dynamic Time Warping (DTW) is a powerful and widely used technique of time series analysis which performs such nonlinear warping in temporal domain. The work in this dissertation develops in two directions. The first direction is to extend the this dynamic time warping to produce a two-level dynamic warping algorithm, with warping in both temporal and spectral domains. While there have been hundreds of research efforts in the last two decades that have applied and used the one-dimensional warping process idea between time series, extending DTW method to two or more dimensions poses a more involved problem. The two-dimensional dynamic warping algorithm developed here for a variety of speech signal processing is ideally suited. The second direction is focused on two speech signal applications. The First application is the evaluation of dysarthric speech. Dysarthria is a neurological motor speech disorder, which characterized by spectral and temporal degradation in speech production. Dysarthria management has focused primarily teaching patients to improve their ability to produce speech or strategies to compensate for their deficits. However, many individuals with dysarthria are not well-suited for traditional speaker-oriented intervention. Recent studies have shown that speech intelligibility can be improved by training the listener to better understand the degraded speech signal. A computer-based training tool was developed using a two-level dynamic warping algorithm to eventually be incorporated into a program that trains listeners to learn to imitate dysarthric speech by providing subjects with feedback about the accuracy of their imitation attempts during training. The second application is voice transformation. Voice transformation techniques aims to modify a subject’s voice characteristics to make them sound like someone else, for example from a male speaker to female speaker. The approach taken here avoids the need to find acoustic parameters as many voice transformation methods do, and instead deals directly with spectral information. Based on the two-Level DW it is straightforward to map the source speech to target speech when both are available. The resulted spectral warping signal produced as described above introduces significant processing artifacts. Phase reconstruction was applied to the transformed signal to improve the quality of the final sound. Neural networks are trained to perform the voice transformation

    Secure VoIP Performance Measurement

    Get PDF
    This project presents a mechanism for instrumentation of secure VoIP calls. The experiments were run under different network conditions and security systems. VoIP services such as Google Talk, Express Talk and Skype were under test. The project allowed analysis of the voice quality of the VoIP services based on the Mean Opinion Score (MOS) values generated by Perceptual valuation of Speech Quality (PESQ). The quality of the audio streams produced were subjected to end-to-end delay, jitter, packet loss and extra processing in the networking hardware and end devices due to Internetworking Layer security or Transport Layer security implementations. The MOS values were mapped to Perceptual Evaluation of Speech Quality for wideband (PESQ-WB) scores. From these PESQ-WB scores, the graphs of the mean of 10 runs and box and whisker plots for each parameter were drawn. Analysis on the graphs was performed in order to deduce the quality of each VoIP service. The E-model was used to predict the network readiness and Common vulnerability Scoring System (CVSS) was used to predict the network vulnerabilities. The project also provided the mechanism to measure the throughput for each test case. The overall performance of each VoIP service was determined by PESQ-WB scores, CVSS scores and the throughput. The experiment demonstrated the relationship among VoIP performance, VoIP security and VoIP service type. The experiment also suggested that, when compared to an unsecure IPIP tunnel, Internetworking Layer security like IPSec ESP or Transport Layer security like OpenVPN TLS would improve a VoIP security by reducing the vulnerabilities of the media part of the VoIP signal. Morever, adding a security layer has little impact on the VoIP voice quality

    Synthetic voice design and implementation.

    Get PDF
    The limitations of speech output technology emphasise the need for exploratory psychological research to maximise the effectiveness of speech as a display medium in human-computer interaction. Stage 1 of this study reviewed speech implementation research, focusing on general issues for tasks, users and environments. An analysis of design issues was conducted, related to the differing methodologies for synthesised and digitised message production. A selection of ergonomic guidelines were developed to enhance effective speech interface design. Stage 2 addressed the negative reactions of users to synthetic speech in spite of elegant dialogue structure and appropriate functional assignment. Synthetic speech interfaces have been consistently rejected by their users in a wide variety of application domains because of their poor quality. Indeed the literature repeatedly emphasises quality as being the most important contributor to implementation acceptance. In order to investigate this, a converging operations approach was adopted. This consisted of a series of five experiments (and associated pilot studies) which homed in on the specific characteristics of synthetic speech that determine the listeners varying perceptions of its qualities, and how these might be manipulated to improve its aesthetics. A flexible and reliable ratings interface was designed to display DECtalk speech variations and record listeners perceptions. In experiment one, 40 participants used this to evaluate synthetic speech variations on a wide range of perceptual scales. Factor analysis revealed two main factors: "listenability" accounting for 44.7% of the variance and correlating with the DECtalk "smoothness" parameter to . 57 (p<0.005) and "richness" to . 53 (p<0.005); "assurance" accounting for 12.6% of the variance and correlating with "average pitch" to . 42 (p<0.005) and "head size" to. 42 (p<0.005). Complimentary experiments were then required in order to address appropriate voice design for enhanced listenability and assurance perceptions. With a standard male voice set, 20 participants rated enhanced smoothness and attenuated richness as contributing significantly to speech listenability (p<0.001). Experiment three using a female voice set yielded comparable results, suggesting that further refinements of the technique were necessary in order to develop an effective methodology for speech quality optimization. At this stage it became essential to focus directly on the parameter modifications that are associated with the the aesthetically pleasing characteristics of synthetic speech. If a reliable technique could be developed to enhance perceived speech quality, then synthesis systems based on the commonly used DECtalk model might assume some of their considerable yet unfulfilled potential. In experiment four, 20 subjects rated a wide range of voices modified across the two main parameters associated with perceived listenability, smoothness and richness. The results clearly revealed a linear relationship between enhanced smoothness and attenuated richness and significant improvements in perceived listenability (p<0.001 in both cases). Planned comparisons conducted were between the different levels of the parameters and revealed significant listenability enhancements as smoothness was increased, and a similar pattern as richness decreased. Statistical analysis also revealed a significant interaction between the two parameters (p<0.001) and a more comprehensive picture was constructed. In order to expand the focus of and enhance the generality of the research, it was now necessary to assess the effects of synthetic speech modifications whilst subjects were undertaking a more realistic task. Passively rating the voices independent of processing for meaning is arguably an artificial task which rarely, if ever, would occur in 'real-world' settings. In order to investigate perceived quality in a more realistic task scenario, experiment five introduced two levels of information processing load. The purpose of this experiment was firstly to see if a comprehension load modified the pattern of listenability enhancements, and secondly to see if that pattern differed between high and and low load. Techniques for introducing cognitive load were investigated and comprehension load was selected as the most appropriate method in this case. A pilot study distinguished two levels of comprehension load from a set of 150 true/false sentences and these were recorded across the full range of parameter modifications. Twenty subjects then rated the voices using the established listenability scales as before but also performing the additional task of processing each spoken stimuli for meaning and determining the authenticity of the statements. Results indicated that listenability enhancements did indeed occur at both levels of processing although at the higher level variations in the pattern occured. A significant difference was revealed between optimal parameter modifications for conditions of high and low cognitive load (p<0.05). The results showed that subjects perceived the synthetic voices in the high cognitive load condition to be significantly less listenable than those same voices in the low cognitive load condition. The analysis also revealed that this effect was independent of the number of errors made. This result may be of general value because conclusions drawn from this findings are independent of any particular parameter modifications that may be exclusively available to DECtalk users. Overall, the study presents a detailed analysis of the research domain combined with a systematic experimental program of synthetic speech quality assessment. The experiments reported establish a reliable and replicable procedure for optimising the aesthetically pleasing characteristics of DECtalk speech, but the implications of the research extend beyond the boundaries of a particular synthesiser. Results from the experimental program lead to a number of conclusions, the most salient being that not only does the synthetic speech designer have to overcome the general rejection of synthetic voices based on their poor quality by sophisticated customisation of synthetic voice parameters, but that he or she needs to take into account the cognitive load of the task being undertaken. The interaction between cognitive load and optimal settings for synthesis requires direct consideration if synthetic speech systems are going to realise and maximise their potential in human computer interaction
    • …
    corecore