6 research outputs found

    Pembuatan Perangkat Basis Data untuk Sintesis Ucapan (Natural Speech Synthesis) Berbahasa Indonesia Berbasis Hidden Markov Model (HMM)

    Get PDF
    Salah satu teknik sintesis ucapan adalah sistem statistik parametrik sintesis ucapan menggunakan Hidden Markov Model (HMM). Speech synthesis dalam bahasa Indonesia dengan menggunakan HTS masih belum pernah dikembangkan (under-resourced). Penelitian ini diawali dengan pembuatan basis data suara bahasa Indonesia melalui proses perekaman, kemudian diikuti dengan proses segmentasi simbol fonetik, dan pemberian label. Dalam penelitian ini diperoleh basis data dalam bahasa Indonesia sejumlah 1529 kalimat yang sesuai dengan kaidah keseimbangan fonetik (phonetically balanced), yaitu telah memenuhi 33 jenis fonem. Selain itu, diperoleh juga segmentasi dan labeling dataset sebanyak 100 kalimat hasil rekaman suara laki-laki dan 100 kalimat hasil rekaman suara wanita. Penyiapan perangkat lunak untuk menjalankan sistem sintesis ucapan berbahasa Inggris berbasis HMM telah dilakukan dengan mengaplikasikan HTS yang menggunakan. Berdasarkan hasil uji kualitas suara menggunakan uji subyektif, melibatkan 20 responden, diperoleh naturalness dengan nilai Mean Opinion Score (MOS) 3,4 untuk pengujian hasil training speaker dependent (SD) training demo dan 3,2 untuk pengujian hasil speaker adaptation/adaptive (SAD) training demo. Dengan demikian, synthetic speech yang dihasilkan dapat dikategorikan baik dan perangkat lunak yang dipakai dapat digunakan untuk melakukan perancangan sistem sintesis ucapan berbahasa Indonesia.

    HMM-Based Emotional Speech Synthesis Using Average Emotion Model

    Full text link
    Abstract. This paper presents a technique for synthesizing emotional speech based on an emotion-independent model which is called “average emotion” model. The average emotion model is trained using a multi-emotion speech da-tabase. Applying a MLLR-based model adaptation method, we can transform the average emotion model to present the target emotion which is not included in the training data. A multi-emotion speech database including four emotions, “neutral”, “happiness”, “sadness”, and “anger”, is used in our experiment. The results of subjective tests show that the average emotion model can effectively synthesize neutral speech and can be adapted to the target emotion model using very limited training data

    Design and development of a speech synthesis software for colombian spanish applied to communication through mobile devices

    Get PDF
    In several scenarios of everyday life, there is a need to communicate orally with other people. However, various technological solutions such as mobile phones cannot be used in places such as meetings, classrooms, or conference rooms without disrupting the activities of people around the speaker. This research develops a tool that enables people to establish a conversation in a public place without disrupting the surrounding environment. To this end, a speech synthesizer is implemented on a personal computer connected to a cell phone, which allows one to establish a mobile call without using the human voice. The speech synthesizer uses the diphone concatenation technique and is developed specifically for the Spanish from Colombia. A mathematical description of the synthesizer shows the decomposition of the synthesizer into various mutually independent processes. Several user-acceptance and quality tests of the obtained signal were performed to evaluate the performance of the tool. The results show a high signal to noise ratio of generated signals and a high intelligibility of the tool

    Trích chọn các tham số đặc trưng tiếng nói cho hệ thống tổng hợp tiếng Việt dựa vào mô hình Markov ẩn

    Get PDF
    Recently, the statistical framework based on Hidden Markov Models (HMMs) plays an important role in the speech synthesis method. The system can be built without requiring a very large speech corpus for training the system. In this method, statistical modeling is applied to learn distributions of context-dependent acoustic vectors extracted from speech signals, each vector contains a suitable parametric representation of one speech frame and Vietnamese phonetic rules to synthesize the speech. The overall performance of the systems is often limited by the accuracy of the underlying speech parameterization and reconstruction method. The method proposed in this paper allows accurate MFCC, F0 and tone extraction and high-quality reconstruction of speech signals assuming Mel Log Spectral Approximation filter. Its suitability for high-quality HMM-based speech synthesis is shown through evaluations subjectively.Phương pháp tổng hợp tiếng nói dựa trên mô hình Markov ẩn (HMM) chỉ cần một kho ngữ liệu tiếng nói thu âm sẵn đủ lớn (bao hàm tất cả các âm vị của một ngôn ngữ) để phục vụ cho mục đích huấn luyện. Trong phương pháp này, mô hình thống kê được sử dụng để mô hình hóa sự phân bố của các véc tơ âm thanh phụ thuộc ngữ cảnh, các véc tơ này được trích rút từ tín hiệu tiếng nói, mỗi véc tơ là một tham số đặc trưng cho khung tín hiệu và các qui tắc ngữ âm tiếng Việt, phục vụ cho quá trình tổng hợp tiếng nói. Hiệu quả của hệ thống bị hạn chế bởi mức độ chính xác khi tham số hóa các đặc trưng tiếng nói và phương pháp tái tạo tín hiệu tiếng nói từ những tham số này. Bài báo này giới thiệu một phương pháp trích chọn các tham số MFCC, F0 và tái tạo tín hiệu tiếng nói chất lượng cao sử dụng bộ lọc MLSA. Phương pháp này thích hợp cho tổng hợp tiếng nói dựa trên HMM và kết quả của nó được đánh giá qua thực tế là khá tốt so với một số phương pháp khác

    Évaluation expérimentale d'un système statistique de synthèse de la parole, HTS, pour la langue française

    Get PDF
    Les travaux présentés dans cette thèse se situent dans le cadre de la synthèse de la parole à partir du texte et, plus précisément, dans le cadre de la synthèse paramétrique utilisant des règles statistiques. Nous nous intéressons à l'influence des descripteurs linguistiques utilisés pour caractériser un signal de parole sur la modélisation effectuée dans le système de synthèse statistique HTS. Pour cela, deux méthodologies d'évaluation objective sont présentées. La première repose sur une modélisation de l'espace acoustique, généré par HTS par des mélanges gaussiens (GMM). En utilisant ensuite un ensemble de signaux de parole de référence, il est possible de comparer les GMM entre eux et ainsi les espaces acoustiques générés par les différentes configurations de HTS. La seconde méthodologie proposée repose sur le calcul de distances entre trames acoustiques appariées pour pouvoir évaluer la modélisation effectuée par HTS de manière plus locale. Cette seconde méthodologie permet de compléter les diverses analyses en contrôlant notamment les ensembles de données générées et évaluées. Les résultats obtenus selon ces deux méthodologies, et confirmés par des évaluations subjectives, indiquent que l'utilisation d'un ensemble complexe de descripteurs linguistiques n'aboutit pas nécessairement à une meilleure modélisation et peut s'avérer contre-productif sur la qualité du signal de synthèse produit.The work presented in this thesis is about TTS speech synthesis and, more particularly, about statistical speech synthesis for French. We present an analysis on the impact of the linguistic contextual factors on the synthesis achieved by the HTS statistical speech synthesis system. To conduct the experiments, two objective evaluation protocols are proposed. The first one uses Gaussian mixture models (GMM) to represent the acoustical space produced by HTS according to a contextual feature set. By using a constant reference set of natural speech stimuli, GMM can be compared between themselves and consequently acoustic spaces generated by HTS. The second objective evaluation that we propose is based on pairwise distances between natural speech and synthetic speech generated by HTS. Results obtained by both protocols, and confirmed by subjective evaluations, show that using a large set of contextual factors does not necessarily improve the modeling and could be counter-productive on the speech quality.RENNES1-Bibl. électronique (352382106) / SudocSudocFranceF
    corecore