Target speaker extraction aims at extracting the target speaker from a
mixture of multiple speakers exploiting auxiliary information about the target
speaker. In this paper, we consider a complete time-domain target speaker
extraction system consisting of a speaker embedder network and a speaker
separator network which are jointly trained in an end-to-end learning process.
We propose two different architectures for the speaker separator network which
are based on the convolutional augmented transformer (conformer). The first
architecture uses stacks of conformer and external feed-forward blocks
(Conformer-FFN), while the second architecture uses stacks of temporal
convolutional network (TCN) and conformer blocks (TCN-Conformer). Experimental
results for 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures of
2-speakers show that among the proposed separator networks, the TCN-Conformer
significantly improves the target speaker extraction performance compared to
the Conformer-FFN and a TCN-based baseline system.Comment: submitted to IWAENC 202