Currently, the development of Foreign Accent Conversion (FAC) models utilizes
deep neural network architectures, as well as ensembles of neural networks for
speech recognition and speech generation. The use of these models is limited by
architectural features, which does not allow flexible changes in the timbre of
the generated speech and requires the accumulation of context, leading to
increased delays in generation and makes these systems unsuitable for use in
real-time multi-user communication scenarios. We have developed the
non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on
input L2 accented speech. The model consists of interconnected modules for
extracting accent, gender, and speaker embeddings, converting speech,
generating spectrograms, and decoding the resulting spectrogram into an audio
signal. The model has the ability to save, clone and change the timbre, gender
and accent of the speaker's voice in real time. The results of the objective
assessment show that the model improves speech quality, leading to enhanced
recognition performance in existing ASR systems. The results of subjective
tests show that the proposed accent and gender encoder improves the generation
quality. The developed model demonstrates high-quality low-latency accent
conversion, voice cloning, and speech enhancement capabilities, making it
suitable for real-time multi-user communication scenarios.Comment: 8 pages, 6 figures, 3 table