4 research outputs found

    Sampling-based speech parameter generation using moment-matching networks

    Full text link
    This paper presents sampling-based speech parameter generation using moment-matching networks for Deep Neural Network (DNN)-based speech synthesis. Although people never produce exactly the same speech even if we try to express the same linguistic and para-linguistic information, typical statistical speech synthesis produces completely the same speech, i.e., there is no inter-utterance variation in synthetic speech. To give synthetic speech natural inter-utterance variation, this paper builds DNN acoustic models that make it possible to randomly sample speech parameters. The DNNs are trained so that they make the moments of generated speech parameters close to those of natural speech parameters. Since the variation of speech parameters is compressed into a low-dimensional simple prior noise vector, our algorithm has lower computation cost than direct sampling of speech parameters. As the first step towards generating synthetic speech that has natural inter-utterance variation, this paper investigates whether or not the proposed sampling-based generation deteriorates synthetic speech quality. In evaluation, we compare speech quality of conventional maximum likelihood-based generation and proposed sampling-based generation. The result demonstrates the proposed generation causes no degradation in speech quality.Comment: Submitted to INTERSPEECH 201

    Optimization Scheme of Joint Noise Suppression and Dereverberation Based on Higher-Order Statistics

    Get PDF
    APSIPA ASC 2012 : Asia-Pacific Signal and Information Processing Association 2012 Annual Summit and Conference, December 3-6, 2012, Hollywood, California, USA.In this paper, we apply the higher-order statistics parameter to automatically improve the performance of blind speech enhancement. Recently, a method to suppress both diffuse background noise and late reverberation part of speech has been proposed combining blind signal extraction and Wiener filtering. However, this method requires a good strategy for choosing the set of its parameters in order to achieve the optimum result and to control the amount of musical noise, which is a common problem in non-linear signal processing. We present an optimization scheme to control the value of Wiener filter coefficients used in this method, which depends on the amount of musical noise generated, measured by higher-order statistics. The noise reduction rate and cepstral distortion are also evaluated to confirm the effectiveness of this scheme

    Overlapped speech and music segmentation using singular spectrum analysis and random forests

    Get PDF
    Recent years have seen ever-increasing volumes of digital media archives and an enormous amount of user-contributed content. As demand for indexing and searching these resources has increased, and new technologies such as multimedia content management systems, en-hanced digital broadcasting, and semantic web have emerged, audio information mining and automated metadata generation have received much attention. Manual indexing and metadata tagging are time-consuming and subject to the biases of individual workers. An automated architecture able to extract information from audio signals, generate content-related text descriptors or metadata, and enable further information mining and searching would be a tangible and valuable solution. In the field of audio classification, audio signals may be broadly divided into speech or music. Most studies, however, neglect the fact that real audio soundtracks may have either speech or music, or a combination of the two, and this is considered the major hurdle to achieving high performance in automatic audio classification, since overlapping can contaminate relevant characteristics and features, causing incorrect classification or information loss. This research undertakes an extensive review of the state of the art by outlining the well-established audio features and machine learning techniques that have been applied in a broad range of audio segmentation and recognition areas. Audio classification systems and the suggested solutions for the mixed soundtracks problem are presented. The suggested solutions can be listed as follows: developing augmented and modified features for recognising audio classes even in the presence of overlaps between them; robust segmentation of a given overlapped soundtrack stream depends on an innovative method of audio decomposition using Singular Spectrum Analysis (SSA) that has been studied extensively and has received increasing attention in the past two decades as a time series decomposition method with many applications; adoption and development of driven classification methods; and finally a technique for continuous time series tasks. In this study, SSA has been investigated and found to be an efficient way to discriminate speech/music in mixed soundtracks by two different methods, each of which has been developed and validated in this research. The first method serves to mitigate the overlapping ratio between speech and music in the mixed soundtracks by generating two new soundtracks with a lower level of overlapping. Next, feature space is calculated for the output audio streams, and these are classified using random forests into either speech or music. One of the distinct characteristics of this method is the separation of the speech/music key features that lead to improve the classification performance. Nevertheless, that did encounter a few obstructions, including excessively long processing time, increased storage requirements (each frame symbolised by two outputs), and this all leads to greater computational load than previously. Meanwhile, the second method em-ploys the SSA technique to decompose a given audio signal into a series of Principal Components (PCs), where each PC corresponds to a particular pattern of oscillation. Then, the transformed well-established feature is measured for each PC in order to classify it into either speech or music based on the baseline classification system using a RF machine learning technique. The classification performance of real-world soundtracks is effectively improved, which is demonstrated by comparing speech/music recognition using conventional classification methods and the proposed SSA method. The second proposed and de-veloped method can detect pure speech, pure music, and mix with a much lower complexity level
    corecore