Search CORE

161 research outputs found

Embedding-Based Speaker Adaptive Training of Deep Neural Networks

Author: Cui Xiaodong
Goel Vaibhava
Saon George
Publication venue
Publication date: 17/10/2017
Field of study

An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling. In this approach, speaker embedding vectors, which are a constant given a particular speaker, are mapped through a control network to layer-dependent element-wise affine transformations to canonicalize the internal feature representations at the output of hidden layers of a main network. The control network for generating the speaker-dependent mappings is jointly estimated with the main network for the overall speaker adaptive acoustic modeling. Experiments on large vocabulary continuous speech recognition (LVCSR) tasks show that the proposed SAT scheme can yield superior performance over the widely-used speaker-aware training using i-vectors with speaker-adapted input features

arXiv.org e-Print Archive

Crossref

Building competitive direct acoustics-to-word models for English conversational speech recognition

Author: Audhkhasi Kartik
Kingsbury Brian
Picheny Michael
Ramabhadran Bhuvana
Saon George
Publication venue
Publication date: 08/12/2017
Field of study

Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple. Prior work has shown that A2W models require orders of magnitude more training data in order to perform comparably to conventional models. Our work also showed this accuracy gap when using the English Switchboard-Fisher data set. This paper describes a recipe to train an A2W model that closes this gap and is at-par with state-of-the-art sub-word based models. We achieve a word error rate of 8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder or language model. We find that model initialization, training data order, and regularization have the most impact on the A2W model performance. Next, we present a joint word-character A2W model that learns to first spell the word and then recognize it. This model provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training.Comment: Submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 201

arXiv.org e-Print Archive

Crossref

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Author: Audhkhasi Kartik
Nahamoo David
Picheny Michael
Ramabhadran Bhuvana
Saon George
Publication venue
Publication date: 22/03/2017
Field of study

Recent work on end-to-end automatic speech recognition (ASR) has shown that the connectionist temporal classification (CTC) loss can be used to convert acoustics to phone or character sequences. Such systems are used with a dictionary and separately-trained Language Model (LM) to produce word sequences. However, they are not truly end-to-end in the sense of mapping acoustics directly to words without an intermediate phone representation. In this paper, we present the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome. These models do not require an LM or even a decoder at run-time and hence recognize speech with minimal complexity. However, due to the large number of word output units, CTC word models require orders of magnitude more data to train reliably compared to traditional systems. We present some techniques to mitigate this issue. Our CTC word model achieves a word error rate of 13.0%/18.8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9.6%/16.0% for phone-based CTC with a 4-gram LM. We also present rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone CTC models.Comment: Submitted to Interspeech-201

arXiv.org e-Print Archive

Crossref

Start your engines: automobile exports, comparing India and China

Author: Miglani Smita
Ray Saon
Publication venue: International Growth Centre
Publication date: 12/07/2016
Field of study

Relying much more heavily on domestically grown lead-firms, India’s car manufacturing industry, in contrast to China’s, has benefited at a slower pace from global best-practices

LSE Research Online

GEMINI: A Generic Multi-Modal Natural Interface Framework for Videogames

Author: G. Saon
H. Sakoe
J. Lockman
L.A. Schwarz
M. Arantes
P.Y. Shih
T. Yamada
Publication venue
Publication date: 01/01/2013
Field of study

In recent years videogame companies have recognized the role of player engagement as a major factor in user experience and enjoyment. This encouraged a greater investment in new types of game controllers such as the WiiMote, Rock Band instruments and the Kinect. However, the native software of these controllers was not originally designed to be used in other game applications. This work addresses this issue by building a middleware framework, which maps body poses or voice commands to actions in any game. This not only warrants a more natural and customized user-experience but it also defines an interoperable virtual controller. In this version of the framework, body poses and voice commands are respectively recognized through the Kinect's built-in cameras and microphones. The acquired data is then translated into the native interaction scheme in real time using a lightweight method based on spatial restrictions. The system is also prepared to use Nintendo's Wiimote as an auxiliary and unobtrusive gamepad for physically or verbally impractical commands. System validation was performed by analyzing the performance of certain tasks and examining user reports. Both confirmed this approach as a practical and alluring alternative to the game's native interaction scheme. In sum, this framework provides a game-controlling tool that is totally customizable and very flexible, thus expanding the market of game consumers.Comment: WorldCIST'13 Internacional Conferenc

arXiv.org e-Print Archive

Crossref

What explains India’s poor performance in garments exports: evidence from five clusters?

Author: Ray Saon
Publication venue
Publication date: 01/05/2019
Field of study

In this paper, we examine the Indian apparel industry to examine the effect of clusters on the sales of this industry. The data has been collected through a primary survey in five garments clusters in India. The variable that is significant in explaining sales in most equations is technology proxied by imported machinery. It has been argued that inter-firm linkages and linkages between firms, service providers and institutions are crucial for competitiveness and this is best achieved through a cluster. Studies on clusters have shown that some clusters have been able to deepen their inter-firm division of labour, raise their competitiveness and break into international markets. Agglomeration may arise from the specialization of a region in a particular industry where firms share common inputs or knowledge. We argue that the main reason for India’s poor performance in garments (compared to other South Asian countries such as Bangladesh) is the lack of proper clusters. The development of the cluster in India has followed the ‘top down’ approach and the natural process through which linkages are developed are yet to occur in most clusters

Munich RePEc Personal Archive

Determining Program Study Using AHP with Dynamic Criterias and Weights Based on GIS-Mobile

Author: Kurniawan Mohammad Rizky
Kurniawati Ayuningtyas
Saon Sharifah
Publication venue: Universitas Negeri Malang
Publication date: 02/07/2020
Field of study

This research aim to develop a decision support system based on GIS-Mobile Apps using Analytical Hierarchy Process (AHP) Algorithm and softmax function for dynamic weight. The stages of AHP dynamic criteria in this system is the preparation of a hierarchy, prioritization, consistency, and the weight of priority. ). The use of AHP in this system involves four criteria which keywords, department accreditation, accreditation of colleges and colleges location distance that can be set by the user dynamically. Experience Programming (XP) is model development that choosed by author for process development system. The step begin with planning, design, coding, and testing. The result of this research is a GIS-Mobile Apps to determine a list of recommended program study with the greatest weight from user input criteria

Portal Jurnal Elektronik Universitas Negeri Malang

Part I: The Construction of a Model-Locked Nd³⁺: Glass Laser and Non-Linear Optical Techniques. Part II: Applications of Picosecond Laser Pulses in Chemistry: Vibrational Relaxation Times in Liquid Alkanes and Alkenes

Author: Patumtevapibal Saon
Publication venue
Publication date: 01/01/1975
Field of study

PART I. The construction and qualitative explanation of the pulsed, mode-locked laser are described: the generation of a train of picosecond 1.06μ pulses is achieved by properly aligning a saturable absorber in the Nd3+: glass laser cavity. The pulsewidth, being on a picosecond time scale, has to be measured' by a special two-photon method. In order to make the laser more chemically useful, second harmonic generation of the fundamental (1.06 μ) pulses is necessary. A phase-matched KDP crystal is employed in this process. Some non-linear optical techniques, such as stimulated Raman scattering and self-phased modulation, which generates continuum light from a monochromatic pulses, also enrich the usage of the laser. Azulene experiment is tried with our laser set-up. PART II. The dephasing times and vibrational lifetimes of C-H stretching vibrations are studied systematically in a series of liquid alkanes and alkenes, using the Raman effect. The results indicate that the vibrational energy loss takes place primarily through the methyl groups in these molecules. A preliminary result of the methylene C-H stretch vibrational lifetime is conducted in liquid CD3-CH2-CH2-CD3</p

Caltech Theses and Dissertations