31 research outputs found

    A Fast Learning Method for Multilayer Perceptrons in Automatic Speech Recognition Systems

    Get PDF
    We propose a fast learning method for multilayer perceptrons (MLPs) on large vocabulary continuous speech recognition (LVCSR) tasks. A preadjusting strategy based on separation of training data and dynamic learning-rate with a cosine function is used to increase the accuracy of a stochastic initial MLP. Weight matrices of the preadjusted MLP are restructured by a method based on singular value decomposition (SVD), reducing the dimensionality of the MLP. A back propagation (BP) algorithm that fits the unfolded weight matrices is used to train the restructured MLP, reducing the time complexity of the learning process. Experimental results indicate that on LVCSR tasks, in comparison with the conventional learning method, this fast learning method can achieve a speedup of around 2.0 times with improvement on both the cross entropy loss and the frame accuracy. Moreover, it can achieve a speedup of approximately 3.5 times with only a little loss of the cross entropy loss and the frame accuracy. Since this method consumes less time and space than the conventional method, it is more suitable for robots which have limitations on hardware

    Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection

    Full text link
    It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go.Comment: This work has been accepted by the 25th International Conference on Pattern Recognition (ICPR2020

    Comparison of Resting-State Brain Activation Detected by BOLD, Blood Volume and Blood Flow

    Get PDF
    Resting-state brain activity has been widely investigated using blood oxygenation level dependent (BOLD) contrast techniques. However, BOLD signal changes reflect a combination of the effects of cerebral blood flow (CBF), cerebral blood volume (CBV), as well as the cerebral metabolic rate of oxygen (CMRO2). In this study, resting-state brain activation was detected and compared using the following techniques: (a) BOLD, using a gradient-echo echo planar imaging (GE-EPI) sequence; (b) CBV-weighted signal, acquired using gradient and spin echo (GRASE) based vascular space occupancy (VASO); and (c) CBF, using pseudo-continuous arterial spin labeling (pCASL). Reliable brain networks were detected using VASO and ASL, including sensorimotor, auditory, primary visual, higher visual, default mode, salience and left/right executive control networks. Differences between the resting-state activation detected with ASL, VASO and BOLD could potentially be due to the different temporal signal-to-noise ratio (tSNR) and the short post-labeling delay (PLD) in ASL, along with differences in the spin-echo readout of VASO. It is also possible that the dynamics of spontaneous fluctuations in BOLD, CBV and CBF could differ due to biological reasons, according to their location within the brain

    Long Short-Term Memory Projection Recurrent Neural Network Architectures for Piano’s Continuous Note Recognition

    Get PDF
    Long Short-Term Memory (LSTM) is a kind of Recurrent Neural Networks (RNN) relating to time series, which has achieved good performance in speech recogniton and image recognition. Long Short-Term Memory Projection (LSTMP) is a variant of LSTM to further optimize speed and performance of LSTM by adding a projection layer. As LSTM and LSTMP have performed well in pattern recognition, in this paper, we combine them with Connectionist Temporal Classification (CTC) to study piano’s continuous note recognition for robotics. Based on the Beijing Forestry University music library, we conduct experiments to show recognition rates and numbers of iterations of LSTM with a single layer, LSTMP with a single layer, and Deep LSTM (DLSTM, LSTM with multilayers). As a result, the single layer LSTMP proves performing much better than the single layer LSTM in both time and the recognition rate; that is, LSTMP has fewer parameters and therefore reduces the training time, and, moreover, benefiting from the projection layer, LSTMP has better performance, too. The best recognition rate of LSTMP is 99.8%. As for DLSTM, the recognition rate can reach 100% because of the effectiveness of the deep structure, but compared with the single layer LSTMP, DLSTM needs more training time

    A Dual-Branch Speech Enhancement Model with Harmonic Repair

    No full text
    Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the auditory quality of the enhanced speech, leading to a decrease in the performance of subsequent tasks such as speech recognition and speaker identification. To address these problems, this paper proposes a Harmonic Repair Large Frame enhancement model, called HRLF-Net, that uses a harmonic repair network for denoising, followed by a real-imaginary dual branch structure for restoration. This approach fully utilizes the harmonic overtones to match the original harmonic distribution of speech. In the subsequent branch process, it restores the speech to specifically optimize its auditory quality to the human ear. Experiments show that under HRLF-Net, the intelligibility and quality of speech are significantly improved, and harmonic information is effectively restored

    Comparison of Resting-State Brain Activation Detected by BOLD, Blood Volume and Blood Flow

    No full text
    Resting-state brain activity has been widely investigated using blood oxygenation level dependent (BOLD) contrast techniques. However, BOLD signal changes reflect a combination of the effects of cerebral blood flow (CBF), cerebral blood volume (CBV), as well as the cerebral metabolic rate of oxygen (CMRO2). In this study, resting-state brain activation was detected and compared using the following techniques: (a) BOLD, using a gradient-echo echo planar imaging (GE-EPI) sequence; (b) CBV-weighted signal, acquired using gradient and spin echo (GRASE) based vascular space occupancy (VASO); and (c) CBF, using pseudo-continuous arterial spin labeling (pCASL). Reliable brain networks were detected using VASO and ASL, including sensorimotor, auditory, primary visual, higher visual, default mode, salience and left/right executive control networks. Differences between the resting-state activation detected with ASL, VASO and BOLD could potentially be due to the different temporal signal-to-noise ratio (tSNR) and the short post-labeling delay (PLD) in ASL, along with differences in the spin-echo readout of VASO. It is also possible that the dynamics of spontaneous fluctuations in BOLD, CBV and CBF could differ due to biological reasons, according to their location within the brain
    corecore