Simultaneous speech translation (SimulST) is a challenging task aiming to
translate streaming speech before the complete input is observed. A SimulST
system generally includes two components: the pre-decision that aggregates the
speech information and the policy that decides to read or write. While recent
works had proposed various strategies to improve the pre-decision, they mainly
adopt the fixed wait-k policy, leaving the adaptive policies rarely explored.
This paper proposes to model the adaptive policy by adapting the Continuous
Integrate-and-Fire (CIF). Compared with monotonic multihead attention (MMA),
our method has the advantage of simpler computation, superior quality at low
latency, and better generalization to long utterances. We conduct experiments
on the MuST-C V2 dataset and show the effectiveness of our approach.Comment: Submitted to INTERSPEECH 202