Taking long-term spectral and temporal dependencies into account is essential
for automatic piano transcription. This is especially helpful when determining
the precise onset and offset for each note in the polyphonic piano content. In
this case, we may rely on the capability of self-attention mechanism in
Transformers to capture these long-term dependencies in the frequency and time
axes. In this work, we propose hFT-Transformer, which is an automatic music
transcription method that uses a two-level hierarchical frequency-time
Transformer architecture. The first hierarchy includes a convolutional block in
the time axis, a Transformer encoder in the frequency axis, and a Transformer
decoder that converts the dimension in the frequency axis. The output is then
fed into the second hierarchy which consists of another Transformer encoder in
the time axis. We evaluated our method with the widely used MAPS and MAESTRO
v3.0.0 datasets, and it demonstrated state-of-the-art performance on all the
F1-scores of the metrics among Frame, Note, Note with Offset, and Note with
Offset and Velocity estimations.Comment: 8 pages, 6 figures, to be published in ISMIR202