1 research outputs found
End-to-End real time tracking of children's reading with pointer network
In this work, we explore how a real time reading tracker can be built
efficiently for children's voices. While previously proposed reading trackers
focused on ASR-based cascaded approaches, we propose a fully end-to-end model
making it less prone to lags in voice tracking. We employ a pointer network
that directly learns to predict positions in the ground truth text conditioned
on the streaming speech. To train this pointer network, we generate ground
truth training signals by using forced alignment between the read speech and
the text being read on the training set. Exploring different forced alignment
models, we find a neural attention based model is at least as close in
alignment accuracy to the Montreal Forced Aligner, but surprisingly is a better
training signal for the pointer network. Our results are reported on one adult
speech data (TIMIT) and two children's speech datasets (CMU Kids and Reading
Races). Our best model can accurately track adult speech with 87.8% accuracy
and the much harder and disfluent children's speech with 77.1% accuracy on CMU
Kids data and a 65.3% accuracy on the Reading Races dataset.Comment: 5 pages, 3 figure