A study on features for speaker recognition by ASAM model

Abstract

For multi-speaker recognition, deep learning-based frameworks have made significant progress in multi-speaker mixed speech separation, but are unable to provide satisfactory solutions in complex auditory scenarios. A unified auditory selection framework with attention and memory can solve this problem. First, the sound characteristics of a specific speaker are accumulated into the lifetime memory during the training phase, while the speech perceptron is trained to extract temporal sound characteristics and update the memory online when the speaker perceives speech. The learning memory is then used to interact with the mix input to add and filter the target frequency from the mix stream. Finally, the network is trained to minimize the reconstruction error of attendance speech. In this study, a single speaker’s voice was extracted from a speech segment containing multiple speakers using an ASAM model, and then speaker recognition was performed using an LSTM neural network. In the LSTM network, MFCC, GFCC, and GBFB will be used to identify the three feature quantities and the results will be compared

    Similar works