74 research outputs found
N-Gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding
The first step to apply deep learning techniques for symbolic music
understanding is to transform musical pieces (mainly in MIDI format) into
sequences of predefined tokens like note pitch, note velocity, and chords.
Subsequently, the sequences are fed into a neural sequence model to accomplish
specific tasks. Music sequences exhibit strong correlations between adjacent
elements, making them prime candidates for N-gram techniques from Natural
Language Processing (NLP). Consider classical piano music: specific melodies
might recur throughout a piece, with subtle variations each time. In this
paper, we propose a novel method, NG-Midiformer, for understanding symbolic
music sequences that leverages the N-gram approach. Our method involves first
processing music pieces into word-like sequences with our proposed unsupervised
compoundation, followed by using our N-gram Transformer encoder, which can
effectively incorporate N-gram information to enhance the primary encoder part
for better understanding of music sequences. The pre-training process on
large-scale music datasets enables the model to thoroughly learn the N-gram
information contained within music sequences, and subsequently apply this
information for making inferences during the fine-tuning stage. Experiment on
various datasets demonstrate the effectiveness of our method and achieved
state-of-the-art performance on a series of music understanding downstream
tasks. The code and model weights will be released at
https://github.com/CinqueOrigin/NG-Midiformer.Comment: 8 pages, 2 figures, aaai202
Enhancing Visually-Rich Document Understanding via Layout Structure Modeling
In recent years, the use of multi-modal pre-trained Transformers has led to
significant advancements in visually-rich document understanding. However,
existing models have mainly focused on features such as text and vision while
neglecting the importance of layout relationship between text nodes. In this
paper, we propose GraphLayoutLM, a novel document understanding model that
leverages the modeling of layout structure graph to inject document layout
knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm
to adjust the text sequence based on the graph structure. Additionally, our
model uses a layout-aware multi-head self-attention layer to learn document
layout knowledge. The proposed model enables the understanding of the spatial
arrangement of text elements, improving document comprehension. We evaluate our
model on various benchmarks, including FUNSD, XFUND and CORD, and achieve
state-of-the-art results among these datasets. Our experimental results
demonstrate that our proposed method provides a significant improvement over
existing approaches and showcases the importance of incorporating layout
information into document understanding models. We also conduct an ablation
study to investigate the contribution of each component of our model. The
results show that both the graph reordering algorithm and the layout-aware
multi-head self-attention layer play a crucial role in achieving the best
performance
- …