209 research outputs found

    Studies on the System Sulphuric Acid-Water-Tri-n-Butyl Phosphate

    Get PDF
    It is intended in this study on the system H₂SO₄-H₂O-TBP to determine the species formed in the equilibrated organic phase and to clarify the extracting mechanism of aqueous sulphuric acid into organic phase. As in the previous report published in this Memoir, physico-chemical measurements of volumeswelling, density, viscosity and electrical conductivity were carried out with the equlibrated organic phase in addition to the conventional distribution measurement of sulphuric acid and water. It was found that the extracting species is [TBP·Hᴏ] at the equilibrated acid concentration in aqueos phase below 2.0M and that three other species were found to exist above 2.0M ; the one formed at lower acid concentration has the general formula [(TBP)₃·H₃O⁺(x+2)H₂O···HSO₄⁻], (x was determined as 2.5), the one formed at medium acid concentration is [TBP·H₃O+(x+2y/3)H₂O···HSO₄-] (y was determined as 0.25), and the one formed at higher acid concentration is [TBP·2{H₃O+(x+5y-3/6)H₂O···HSO₄}] They dissociate partly. The activities and activity coefficients of the two species [TBP·H₂O] and [(TBP)₃·H₃O⁺4.5H₂O···HSO₄⁻] stable at lower acid concentration and the equilibrium constant between them were determined with Redlich-Kister equations

    Response type selection for chat-like spoken dialog systems based on LSTM and multi-task learning

    Get PDF
    We propose a method of automatically selecting appropriate responses in conversational spoken dialog systems by explicitly determining the correct response type that is needed first, based on a comparison of the user’s input utterance with many other utterances. Response utterances are then generated based on this response type designation (back channel, changing the topic, expanding the topic, etc.). This allows the generation of more appropriate responses than conventional end-to-end approaches, which only use the user’s input to directly generate response utterances. As a response type selector, we propose an LSTM-based encoder–decoder framework utilizing acoustic and linguistic features extracted from input utterances. In order to extract these features more accurately, we utilize not only input utterances but also response utterances in the training corpus. To do so, multi-task learning using multiple decoders is also investigated. To evaluate our proposed method, we conducted experiments using a corpus of dialogs between elderly people and an interviewer. Our proposed method outperformed conventional methods using either a point-wise classifier based on Support Vector Machines, or a single-task learning LSTM. The best performance was achieved when our two response type selectors (one trained using acoustic features, and the other trained using linguistic features) were combined, and multi-task learning was also performed

    E2E SPEECH RECOGNITION WITH CTC AND LOCAL ATTENTION

    Get PDF
    Many end-to-end, large vocabulary, continuous speech recognition systems are now able to achieve better speech recognition performance than conventional systems. Most of these approaches are based on bidirectional networks and sequence-to-sequence modeling however, so automatic speech recognition (ASR) systems using such techniques need to wait for an entire segment of voice input to be entered before they can begin processing the data, resulting in a lengthy time-lag, which can be a serious drawback in some applications. An obvious solution to this problem is to develop a speech recognition algorithm capable of processing streaming data. Therefore, in this paper we explore the possibility of a streaming, online, ASR system for Japanese using a model based on unidirectional LSTMs trained using connectionist temporal classification (CTC) criteria, with local attention. Such an approach has not been well investigated for use with Japanese, as most Japanese-language ASR systems employ bidirectional networks. The best result for our proposed system during experimental evaluation was a character error rate of 9.87%

    Stability of boundary element methods for the two dimensional wave equation in time domain revisited

    Get PDF
    This study considers the stability of time domain BEMs for the wave equation in 2D. We show that the question of stability of time domain BEMs is reduced to a nonlinear eigenvalue problem related to frequency domain integral equations. We propose to solve this non-linear eigenvalue problem numerically with the Sakurai-Sugiura method. After validating this approach numerically in the exterior Dirichlet problem, we proceed to transmission problems in which we find that some time domain counterparts of “resonance-free” integral equations in frequency domain lead to instability. We finally show that the proposed stability analysis helps to reformulate these equations to obtain stable numerical schemes

    Web-based environment for user generation of spoken dialog for virtual assistants

    Get PDF
    In this paper, a web-based spoken dialog generation environment which enables users to edit dialogs with a video virtual assistant is developed and to also select the 3D motions and tone of voice for the assistant. In our proposed system, “anyone” can “easily” post/edit contents of the dialog for the dialog system. The dialog type corresponding to the system is limited to the question-and-answer type dialog, in order to avoid editing conflicts caused by editing by multiple users. The spoken dialog sharing service and FST generator generates spoken dialog content for the MMDAgent spoken dialog system toolkit, which includes a speech recognizer, a dialog control unit, a speech synthesizer, and a virtual agent. For dialog content creation, question-and-answer dialogs posted by users and FST templates are used. The proposed system was operated for more than a year in a student lounge at the Nagoya Institute of Technology, where users added more than 500 dialogs during the experiment. Images were also registered to 65% of the postings. The most posted category is related to “animation, video games, manga.” The system was subjected to open examination by tourist information staff who had no prior experience with spoken dialog systems. Based on their impressions of tourist use of the dialog system, they shortened the length of some of the system’s responses and added pauses to the longer responses to make them easier to understand

    MMDAE : Dialog scenario editor for MMDAgent on the web browser

    Get PDF
    We have developed MMDAgent (a fully open-source toolkit for voice interaction systems), which runs on a variety of platforms such as personal computers and smartphones. From this, the editing environment of the dialog scenario also needs to be operated on various platforms. So, we develop a scenario editor that is implemented on a Web browser. The purpose of this paper also includes making it easy to edit the scenario. Experiments were conducted for subjects using the proposed scenario editor. It was found that our proposed system provides better readability of a scenario and allows easier editing

    単音節の有効持続時間と感音難聴者の語音明瞭度との関係

    Get PDF
    Among the temporal elements in the autocorrelation function, the effective duration (τe) is a useful indicator of speech recognition for patients with sensorineural hearing impairment. We assessed the influence of speech recognition performance on the relationship between the percentage of accurately perceived articulation and the median τe (τe-med) and the relationship between monosyllabic confusion and the τe-med. Significant correlations were observed between the articulation percentage and the average τe-med in groups with high, middle, and low speech recognition scores (SRSs). Two-factor mixed analysis of variance revealed significant main effects for the condition (presentation/response). There was no significant main effect for group (high-, middle-, or low-SRS) scores and no significant interaction between the groups. The average τe-med of the response was significantly longer than that of the presentation in all three groups. Monosyllables with short τe-med values tended to be misheard as monosyllables with a long τe-med when confusion occurred. The τe-med was useful for estimating monosyllables that patients with sensorineural hearing impairment find easy to listen to, independent of speech recognition performance.権利情報:© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/)

    Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

    Get PDF
    Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data
    corecore