1,002 research outputs found
Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis.
Deep neural networks (DNNs) use a cascade of hidden representa-tions to enable the learning of complex mappings from input to out-put features. They are able to learn the complex mapping from text-based linguistic features to speech acoustic features, and so perform text-to-speech synthesis. Recent results suggest that DNNs can pro-duce more natural synthetic speech than conventional HMM-based statistical parametric systems. In this paper, we show that the hidden representation used within a DNN can be improved through the use of Multi-Task Learning, and that stacking multiple frames of hid-den layer activations (stacked bottleneck features) also leads to im-provements. Experimental results confirmed the effectiveness of the proposed methods, and in listening tests we find that stacked bottle-neck features in particular offer a significant improvement over both a baseline DNN and a benchmark HMM system. Index Terms — Speech synthesis, acoustic model, multi-task learning, deep neural network, bottleneck featur
Sampling-based speech parameter generation using moment-matching networks
This paper presents sampling-based speech parameter generation using
moment-matching networks for Deep Neural Network (DNN)-based speech synthesis.
Although people never produce exactly the same speech even if we try to express
the same linguistic and para-linguistic information, typical statistical speech
synthesis produces completely the same speech, i.e., there is no
inter-utterance variation in synthetic speech. To give synthetic speech natural
inter-utterance variation, this paper builds DNN acoustic models that make it
possible to randomly sample speech parameters. The DNNs are trained so that
they make the moments of generated speech parameters close to those of natural
speech parameters. Since the variation of speech parameters is compressed into
a low-dimensional simple prior noise vector, our algorithm has lower
computation cost than direct sampling of speech parameters. As the first step
towards generating synthetic speech that has natural inter-utterance variation,
this paper investigates whether or not the proposed sampling-based generation
deteriorates synthetic speech quality. In evaluation, we compare speech quality
of conventional maximum likelihood-based generation and proposed sampling-based
generation. The result demonstrates the proposed generation causes no
degradation in speech quality.Comment: Submitted to INTERSPEECH 201
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs
Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times
Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision
This paper presents methods of making using of text supervision to improve
the performance of sequence-to-sequence (seq2seq) voice conversion. Compared
with conventional frame-to-frame voice conversion approaches, the seq2seq
acoustic modeling method proposed in our previous work achieved higher
naturalness and similarity. In this paper, we further improve its performance
by utilizing the text transcriptions of parallel training data. First, a
multi-task learning structure is designed which adds auxiliary classifiers to
the middle layers of the seq2seq model and predicts linguistic labels as a
secondary task. Second, a data-augmentation method is proposed which utilizes
text alignment to produce extra parallel sequences for model training.
Experiments are conducted to evaluate our proposed method with training sets at
different sizes. Experimental results show that the multi-task learning with
linguistic labels is effective at reducing the errors of seq2seq voice
conversion. The data-augmentation method can further improve the performance of
seq2seq voice conversion when only 50 or 100 training utterances are available.Comment: 5 pages, 4 figures, 2 tables. Submitted to IEEE ICASSP 201
- …