168,783 research outputs found

    ๋”ฅ๋Ÿฌ๋‹์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ๊น€๋‚จ์ˆ˜.The neural network-based speech synthesis techniques have been developed over the years. Although neural speech synthesis has shown remarkable generated speech quality, there are still remaining problems such as modeling power in a neural statistical parametric speech synthesis system, style expressiveness, and robust attention model in the end-to-end speech synthesis system. In this thesis, novel alternatives are proposed to resolve these drawbacks of the conventional neural speech synthesis system. In the first approach, we propose an adversarially trained variational recurrent neural network (AdVRNN), which applies a variational recurrent neural network (VRNN) to represent the variability of natural speech for acoustic modeling in neural statistical parametric speech synthesis. Also, we apply an adversarial learning scheme in training AdVRNN to overcome the oversmoothing problem. From the experimental results, we have found that the proposed AdVRNN based method outperforms the conventional RNN-based techniques. In the second approach, we propose a novel style modeling method employing mutual information neural estimator (MINE) in a style-adaptive end-to-end speech synthesis system. MINE is applied to increase target-style information and suppress text information in style embedding by applying MINE loss term in the loss function. The experimental results show that the MINE-based method has shown promising performance in both speech quality and style similarity for the global style token-Tacotron. In the third approach, we propose a novel attention method called memory attention for end-to-end speech synthesis, which is inspired by the gating mechanism of long-short term memory (LSTM). Leveraging the gating technique's sequence modeling power in LSTM, memory attention obtains the stable alignment from the content-based and location-based features. We evaluate the memory attention and compare its performance with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the results, we conclude that memory attention can generate speech with large variability robustly. In the last approach, we propose selective multi-attention for style-adaptive end-to-end speech synthesis systems. The conventional single attention model may limit the expressivity representing numerous alignment paths depending on style. To achieve a variation in attention alignment, we propose using a multi-attention model with a selection network. The multi-attention plays a role in generating candidates for the target style, and the selection network choose the most proper attention among the multi-attention. The experimental results show that selective multi-attention outperforms the conventional single attention techniques in multi-speaker speech synthesis and emotional speech synthesis.๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ์ˆ ์€ ์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„ ํš”๋ฐœํ•˜๊ฒŒ ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹์˜ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์Œ์„ฑ ํ•ฉ์„ฑ ํ’ˆ์งˆ์€ ๋น„์•ฝ์ ์œผ๋กœ ๋ฐœ์ „ํ–ˆ์ง€๋งŒ, ์•„์ง ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ํ•ฉ์„ฑ์—๋Š” ์—ฌ๋Ÿฌ ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ ์Œํ–ฅ ๋ชจ๋ธ์˜ deterministicํ•œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ์Šคํƒ€์ผ์„ ํ‘œํ˜„ํ•˜๋Š” ๋Šฅ๋ ฅ๊ณผ ๊ฐ•์ธํ•œ ์–ดํ…์…˜(attention)์— ๋Œ€ํ•œ ์ด์Šˆ๊ฐ€ ๋Š์ž„์—†์ด ์žฌ๊ธฐ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์Œ์„ฑ ํ•ฉ์„ฑ ์‹œ์Šคํ…œ์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•  ์ƒˆ๋กœ์šด ๋Œ€์•ˆ์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๋‰ด๋Ÿด ํ†ต๊ณ„์  ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐฉ์‹์˜ ์Œํ–ฅ ๋ชจ๋ธ๋ง์„ ๊ณ ๋„ํ™”ํ•˜๊ธฐ ์œ„ํ•œ adversarially trained variational recurrent neural network (AdVRNN) ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. AdVRNN ๊ธฐ๋ฒ•์€ VRNN์„ ์Œ์„ฑ ํ•ฉ์„ฑ์— ์ ์šฉํ•˜์—ฌ ์Œ์„ฑ์˜ ๋ณ€ํ™”๋ฅผ stochastic ํ•˜๊ณ  ์ž์„ธํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์ ๋Œ€์  ํ•™์Šต์ (adversarial learning) ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ oversmoothing ๋ฌธ์ œ๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋„๋ก ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ์กด์˜ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ์Œํ–ฅ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜์—ฌ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ์œ„ํ•œ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰ ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ํ•™์Šต ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ global style token(GST) ๊ธฐ๋ฐ˜์˜ ์Šคํƒ€์ผ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ, ๋น„์ง€๋„ ํ•™์Šต์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์›ํ•˜๋Š” ๋ชฉํ‘œ ์Šคํƒ€์ผ์ด ์žˆ์–ด๋„ ์ด๋ฅผ ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ ์–ด๋ ค์› ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด GST์˜ ์ถœ๋ ฅ๊ณผ ๋ชฉํ‘œ ์Šคํƒ€์ผ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ตœ๋Œ€ํ™” ํ•˜๋„๋ก ํ•™์Šต ์‹œํ‚ค๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ƒํ˜ธ ์ •๋ณด๋Ÿ‰์„ ์ข…๋‹จํ˜• ๋ชจ๋ธ์˜ ์†์‹คํ•จ์ˆ˜์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ mutual information neural estimator(MINE) ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์˜€๊ณ  ๋‹คํ™”์ž ๋ชจ๋ธ์„ ํ†ตํ•ด ๊ธฐ์กด์˜ GST ๊ธฐ๋ฒ•์— ๋น„ํ•ด ๋ชฉํ‘œ ์Šคํƒ€์ผ์„ ๋ณด๋‹ค ์ค‘์ ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์„ธ๋ฒˆ์งธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, ๊ฐ•์ธํ•œ ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์–ดํ…์…˜์ธ memory attention์„ ์ œ์•ˆํ•œ๋‹ค. Long-short term memory(LSTM)์˜ gating ๊ธฐ์ˆ ์€ sequence๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š”๋ฐ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์™”๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์„ ์–ดํ…์…˜์— ์ ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์„ ๊ฐ€์ง„ ์Œ์„ฑ์—์„œ๋„ ์–ดํ…์…˜์˜ ๋Š๊น€, ๋ฐ˜๋ณต ๋“ฑ์„ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹จ์ผ ํ™”์ž์™€ ๊ฐ์ • ์Œ์„ฑ ํ•ฉ์„ฑ ๊ธฐ๋ฒ•์„ ํ† ๋Œ€๋กœ memory attention์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜์˜€์œผ๋ฉฐ ๊ธฐ์กด ๊ธฐ๋ฒ• ๋Œ€๋น„ ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ์–ดํ…์…˜ ๊ณก์„ ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰ ์ ‘๊ทผ๋ฒ•์œผ๋กœ์„œ, selective multi-attention (SMA)์„ ํ™œ์šฉํ•œ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ ์–ดํ…์…˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ ์Šคํƒ€์ผ ์ ์‘ํ˜• ์ข…๋‹จํ˜• ์Œ์„ฑ ํ•ฉ์„ฑ์˜ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‚ญ๋…์ฒด ๋‹จ์ผํ™”์ž์˜ ๊ฒฝ์šฐ์™€ ๊ฐ™์€ ๋‹จ์ผ ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์™”๋‹ค. ํ•˜์ง€๋งŒ ์Šคํƒ€์ผ ์Œ์„ฑ์˜ ๊ฒฝ์šฐ ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ์–ดํ…์…˜ ํ‘œํ˜„์„ ์š”๊ตฌํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์ค‘ ์–ดํ…์…˜์„ ํ™œ์šฉํ•˜์—ฌ ํ›„๋ณด๋“ค์„ ์ƒ์„ฑํ•˜๊ณ  ์ด๋ฅผ ์„ ํƒ ๋„คํŠธ์›Œํฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ์–ดํ…์…˜์„ ์„ ํƒํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. SMA ๊ธฐ๋ฒ•์€ ๊ธฐ์กด์˜ ์–ดํ…์…˜๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ๋ณด๋‹ค ๋งŽ์€ ์Šคํƒ€์ผ์„ ์•ˆ์ •์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.1 Introduction 1 1.1 Background 1 1.2 Scope of thesis 3 2 Neural Speech Synthesis System 7 2.1 Overview of a Neural Statistical Parametric Speech Synthesis System 7 2.2 Overview of End-to-end Speech Synthesis System 9 2.3 Tacotron2 10 2.4 Attention Mechanism 12 2.4.1 Location Sensitive Attention 12 2.4.2 Forward Attention 13 2.4.3 Dynamic Convolution Attention 14 3 Neural Statistical Parametric Speech Synthesis using AdVRNN 17 3.1 Introduction 17 3.2 Background 19 3.2.1 Variational Autoencoder 19 3.2.2 Variational Recurrent Neural Network 20 3.3 Speech Synthesis Using AdVRNN 22 3.3.1 AdVRNN based Acoustic Modeling 23 3.3.2 Training Procedure 24 3.4 Experiments 25 3.4.1 Objective performance evaluation 28 3.4.2 Subjective performance evaluation 29 3.5 Summary 29 4 Speech Style Modeling Method using Mutual Information for End-to-End Speech Synthesis 31 4.1 Introduction 31 4.2 Background 33 4.2.1 Mutual Information 33 4.2.2 Mutual Information Neural Estimator 34 4.2.3 Global Style Token 34 4.3 Style Token end-to-end speech synthesis using MINE 35 4.4 Experiments 36 4.5 Summary 38 5 Memory Attention: Robust Alignment using Gating Mechanism for End-to-End Speech Synthesis 45 5.1 Introduction 45 5.2 BACKGROUND 48 5.3 Memory Attention 49 5.4 Experiments 52 5.4.1 Experiments on Single Speaker Speech Synthesis 53 5.4.2 Experiments on Emotional Speech Synthesis 56 5.5 Summary 59 6 Selective Multi-attention for style-adaptive end-to-End Speech Syn-thesis 63 6.1 Introduction 63 6.2 BACKGROUND 65 6.3 Selective multi-attention model 66 6.4 EXPERIMENTS 67 6.4.1 Multi-speaker speech synthesis experiments 68 6.4.2 Experiments on Emotional Speech Synthesis 73 6.5 Summary 77 7 Conclusions 79 Bibliography 83 ์š”์•ฝ 93 ๊ฐ์‚ฌ์˜ ๊ธ€ 95Docto

    DESIGN ENHANCEMENT AND INTEGRATION OF A PROCESSOR-MEMORY INTERCONNECT NETWORK INTO A SINGLE-CHIP MULTIPROCESSOR ARCHITECTURE

    Get PDF
    This thesis involves modeling, design, Hardware Description Language (HDL) design capture, synthesis, implementation and HDL virtual prototype simulation validation of an interconnect network for a Hybrid Data/Command Driven Computer Architecture (HDCA) system. The HDCA is a single-chip shared memory multiprocessor architecture system. Various candidate processor-memory interconnect topologies that may meet the requirements of the HDCA system are studied and evaluated related to utilization within the HDCA system. It is determined that the Crossbar network topology best meets the HDCA system requirements and it is therefore used as the processormemory interconnect network of the HDCA system. The design capture, synthesis, implementation and HDL simulation is done in VHDL using XILINX ISE 6.2.3i and ModelSim 5.7g CAD softwares. The design is validated by individually testing against some possible test cases and then integrated into the HDCA system and validated against two different applications. The inclusion of crossbar switch in the HDCA architecture involved major modifications to the HDCA system and some minor changes in the design of the switch. Virtual Prototype testing of the HDCA executing applications when utilizing crossbar interconnect revealed proper functioning of the interconnect and HDCA. Inclusion of the interconnect into the HDCA now allows it to implement dynamic node level reconfigurability and multiple forking functionality

    Individualized HRTFs From Few Measurements: a Statistical Learning Approach

    No full text
    ยฉ2005 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEEInternational audienceVirtual Auditory Space (VAS) refers to the synthesis and simulation of spatial hearing using earphones and/or a speaker system. High-fidelity VAS requires the use of individualized head-related transfer functions (HRTFs) which describe the acoustic filtering properties of the listener's external auditory periphery. HRTFs serve the increasingly dominant role of implementation 3-D audio systems, which have been realized in some commercial applications. However, the cost of a 3-D audio system cannot be brought down because the efficiency of computation, the size of memory, and the synthesis of unmeasured HRTFs remain to be made better. Because HRTFs are unique for each user depending on his morphology, the economically realist synthesis of individualized HRTFs has to rely on some measurements. This paper presents a way to reduce the cost of a 3-D audio system using a statistical modeling which allows to use only few measurements for each user

    From FPGA to ASIC: A RISC-V processor experience

    Get PDF
    This work document a correct design flow using these tools in the Lagarto RISC- V Processor and the RTL design considerations that must be taken into account, to move from a design for FPGA to design for ASIC

    High-level synthesis under I/O Timing and Memory constraints

    Full text link
    The design of complex Systems-on-Chips implies to take into account communication and memory access constraints for the integration of dedicated hardware accelerator. In this paper, we present a methodology and a tool that allow the High-Level Synthesis of DSP algorithm, under both I/O timing and memory constraints. Based on formal models and a generic architecture, this tool helps the designer to find a reasonable trade-off between both the required I/O timing behavior and the internal memory access parallelism of the circuit. The interest of our approach is demonstrated on the case study of a FFT algorithm

    Qubit Data Structures for Analyzing Computing Systems

    Full text link
    Qubit models and methods for improving the performance of software and hardware for analyzing digital devices through increasing the dimension of the data structures and memory are proposed. The basic concepts, terminology and definitions necessary for the implementation of quantum computing when analyzing virtual computers are introduced. The investigation results concerning design and modeling computer systems in a cyberspace based on the use of two-component structure are presented.Comment: 9 pages,4 figures, Proceeding of the Third International Conference on Data Mining & Knowledge Management Process (CDKP 2014

    PRESENCE: A human-inspired architecture for speech-based human-machine interaction

    No full text
    Recent years have seen steady improvements in the quality and performance of speech-based human-machine interaction driven by a significant convergence in the methods and techniques employed. However, the quantity of training data required to improve state-of-the-art systems seems to be growing exponentially and performance appears to be asymptotic to a level that may be inadequate for many real-world applications. This suggests that there may be a fundamental flaw in the underlying architecture of contemporary systems, as well as a failure to capitalize on the combinatorial properties of human spoken language. This paper addresses these issues and presents a novel architecture for speech-based human-machine interaction inspired by recent findings in the neurobiology of living systems. Called PRESENCE-"PREdictive SENsorimotor Control and Emulation" - this new architecture blurs the distinction between the core components of a traditional spoken language dialogue system and instead focuses on a recursive hierarchical feedback control structure. Cooperative and communicative behavior emerges as a by-product of an architecture that is founded on a model of interaction in which the system has in mind the needs and intentions of a user and a user has in mind the needs and intentions of the system
    • โ€ฆ
    corecore