Search CORE

3 research outputs found

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Author: Akama Taketo
Ikemiya Yukara
Liao Wei-Hsiang
Mitsufuji Yuki
Takida Yuhta
Toyama Keisuke
Publication venue
Publication date: 09/07/2023
Field of study

Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capability of self-attention mechanism in Transformers to capture these long-term dependencies in the frequency and time axes. In this work, we propose hFT-Transformer, which is an automatic music transcription method that uses a two-level hierarchical frequency-time Transformer architecture. The first hierarchy includes a convolutional block in the time axis, a Transformer encoder in the frequency axis, and a Transformer decoder that converts the dimension in the frequency axis. The output is then fed into the second hierarchy which consists of another Transformer encoder in the time axis. We evaluated our method with the widely used MAPS and MAESTRO v3.0.0 datasets, and it demonstrated state-of-the-art performance on all the F1-scores of the metrics among Frame, Note, Note with Offset, and Note with Offset and Velocity estimations.Comment: 8 pages, 6 figures, to be published in ISMIR202

arXiv.org e-Print Archive

Transcribing vocal expression from polyphonic music

Author: Hiroshi G. Okuno
Katsutoshi Itoyama
Yukara Ikemiya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/12/2014
Field of study

A method for transcribing vocal expressions such as vibrato, glissando, and kobushi separately from polyphonic music is described. The expressions appear as fluctuation in the fun-damental frequency contour of the singing voice. They can be used for search and retrieval of music and for expres-sive singing voice synthesis based on singing style since they strongly reflect the individuality of the singer. The fundamen-tal frequency contour of the singing voice is estimated using the Viterbi algorithm with limitation from a corresponding note sequence. Next, the notes are aligned with the fundamen-tal frequency sequence temporally. Finally, each expression is identified and parameterized in accordance with designed rules. Experiments demonstrated that this method can transcribe ex-pressions in the singing voice from commercial recordings

CiteSeerX

Crossref