31 research outputs found
A Fast Learning Method for Multilayer Perceptrons in Automatic Speech Recognition Systems
We propose a fast learning method for multilayer perceptrons (MLPs) on large vocabulary continuous speech recognition (LVCSR) tasks. A preadjusting strategy based on separation of training data and dynamic learning-rate with a cosine function is used to increase the accuracy of a stochastic initial MLP. Weight matrices of the preadjusted MLP are restructured by a method based on singular value decomposition (SVD), reducing the dimensionality of the MLP. A back propagation (BP) algorithm that fits the unfolded weight matrices is used to train the restructured MLP, reducing the time complexity of the learning process. Experimental results indicate that on LVCSR tasks, in comparison with the conventional learning method, this fast learning method can achieve a speedup of around 2.0 times with improvement on both the cross entropy loss and the frame accuracy. Moreover, it can achieve a speedup of approximately 3.5 times with only a little loss of the cross entropy loss and the frame accuracy. Since this method consumes less time and space than the conventional method, it is more suitable for robots which have limitations on hardware
Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection
It becomes urgent to design effective anti-spoofing algorithms for vulnerable
automatic speaker verification systems due to the advancement of high-quality
playback devices. Current studies mainly treat anti-spoofing as a binary
classification problem between bonafide and spoofed utterances, while lack of
indistinguishable samples makes it difficult to train a robust spoofing
detector. In this paper, we argue that for anti-spoofing, it needs more
attention for indistinguishable samples over easily-classified ones in the
modeling process, to make correct discrimination a top priority. Therefore, to
mitigate the data discrepancy between training and inference, we propose to
leverage a balanced focal loss function as the training objective to
dynamically scale the loss based on the traits of the sample itself. Besides,
in the experiments, we select three kinds of features that contain both
magnitude-based and phase-based information to form complementary and
informative features. Experimental results on the ASVspoof2019 dataset
demonstrate the superiority of the proposed methods by comparison between our
systems and top-performing ones. Systems trained with the balanced focal loss
perform significantly better than conventional cross-entropy loss. With
complementary features, our fusion system with only three kinds of features
outperforms other systems containing five or more complex single models by
22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124
and 0.55% respectively. Furthermore, we present and discuss the evaluation
results on real replay data apart from the simulated ASVspoof2019 data,
indicating that research for anti-spoofing still has a long way to go.Comment: This work has been accepted by the 25th International Conference on
Pattern Recognition (ICPR2020
Comparison of Resting-State Brain Activation Detected by BOLD, Blood Volume and Blood Flow
Resting-state brain activity has been widely investigated using blood oxygenation level dependent (BOLD) contrast techniques. However, BOLD signal changes reflect a combination of the effects of cerebral blood flow (CBF), cerebral blood volume (CBV), as well as the cerebral metabolic rate of oxygen (CMRO2). In this study, resting-state brain activation was detected and compared using the following techniques: (a) BOLD, using a gradient-echo echo planar imaging (GE-EPI) sequence; (b) CBV-weighted signal, acquired using gradient and spin echo (GRASE) based vascular space occupancy (VASO); and (c) CBF, using pseudo-continuous arterial spin labeling (pCASL). Reliable brain networks were detected using VASO and ASL, including sensorimotor, auditory, primary visual, higher visual, default mode, salience and left/right executive control networks. Differences between the resting-state activation detected with ASL, VASO and BOLD could potentially be due to the different temporal signal-to-noise ratio (tSNR) and the short post-labeling delay (PLD) in ASL, along with differences in the spin-echo readout of VASO. It is also possible that the dynamics of spontaneous fluctuations in BOLD, CBV and CBF could differ due to biological reasons, according to their location within the brain
Long Short-Term Memory Projection Recurrent Neural Network Architectures for Piano’s Continuous Note Recognition
Long Short-Term Memory (LSTM) is a kind of Recurrent Neural Networks (RNN) relating to time series, which has achieved good performance in speech recogniton and image recognition. Long Short-Term Memory Projection (LSTMP) is a variant of LSTM to further optimize speed and performance of LSTM by adding a projection layer. As LSTM and LSTMP have performed well in pattern recognition, in this paper, we combine them with Connectionist Temporal Classification (CTC) to study piano’s continuous note recognition for robotics. Based on the Beijing Forestry University music library, we conduct experiments to show recognition rates and numbers of iterations of LSTM with a single layer, LSTMP with a single layer, and Deep LSTM (DLSTM, LSTM with multilayers). As a result, the single layer LSTMP proves performing much better than the single layer LSTM in both time and the recognition rate; that is, LSTMP has fewer parameters and therefore reduces the training time, and, moreover, benefiting from the projection layer, LSTMP has better performance, too. The best recognition rate of LSTMP is 99.8%. As for DLSTM, the recognition rate can reach 100% because of the effectiveness of the deep structure, but compared with the single layer LSTMP, DLSTM needs more training time
A Dual-Branch Speech Enhancement Model with Harmonic Repair
Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the auditory quality of the enhanced speech, leading to a decrease in the performance of subsequent tasks such as speech recognition and speaker identification. To address these problems, this paper proposes a Harmonic Repair Large Frame enhancement model, called HRLF-Net, that uses a harmonic repair network for denoising, followed by a real-imaginary dual branch structure for restoration. This approach fully utilizes the harmonic overtones to match the original harmonic distribution of speech. In the subsequent branch process, it restores the speech to specifically optimize its auditory quality to the human ear. Experiments show that under HRLF-Net, the intelligibility and quality of speech are significantly improved, and harmonic information is effectively restored
Comparison of Resting-State Brain Activation Detected by BOLD, Blood Volume and Blood Flow
Resting-state brain activity has been widely investigated using blood oxygenation level dependent (BOLD) contrast techniques. However, BOLD signal changes reflect a combination of the effects of cerebral blood flow (CBF), cerebral blood volume (CBV), as well as the cerebral metabolic rate of oxygen (CMRO2). In this study, resting-state brain activation was detected and compared using the following techniques: (a) BOLD, using a gradient-echo echo planar imaging (GE-EPI) sequence; (b) CBV-weighted signal, acquired using gradient and spin echo (GRASE) based vascular space occupancy (VASO); and (c) CBF, using pseudo-continuous arterial spin labeling (pCASL). Reliable brain networks were detected using VASO and ASL, including sensorimotor, auditory, primary visual, higher visual, default mode, salience and left/right executive control networks. Differences between the resting-state activation detected with ASL, VASO and BOLD could potentially be due to the different temporal signal-to-noise ratio (tSNR) and the short post-labeling delay (PLD) in ASL, along with differences in the spin-echo readout of VASO. It is also possible that the dynamics of spontaneous fluctuations in BOLD, CBV and CBF could differ due to biological reasons, according to their location within the brain