14 research outputs found
High-dimensional sequence transduction
We investigate the problem of transforming an input sequence into a
high-dimensional output sequence in order to transcribe polyphonic audio music
into symbolic notation. We introduce a probabilistic model based on a recurrent
neural network that is able to learn realistic output distributions given the
input and we devise an efficient algorithm to search for the global mode of
that distribution. The resulting method produces musically plausible
transcriptions even under high levels of noise and drastically outperforms
previous state-of-the-art approaches on five datasets of synthesized sounds and
real recordings, approximately halving the test error rate
End-to-End Attention-based Large Vocabulary Speech Recognition
Many of the current state-of-the-art Large Vocabulary Continuous Speech
Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov
Models (HMMs). Most of these systems contain separate components that deal with
the acoustic modelling, language modelling and sequence decoding. We
investigate a more direct approach in which the HMM is replaced with a
Recurrent Neural Network (RNN) that performs sequence prediction directly at
the character level. Alignment between the input features and the desired
character sequence is learned automatically by an attention mechanism built
into the RNN. For each predicted character, the attention mechanism scans the
input sequence and chooses relevant frames. We propose two methods to speed up
this operation: limiting the scan to a subset of most promising frames and
pooling over time the information contained in neighboring frames, thereby
reducing source sequence length. Integrating an n-gram language model into the
decoding process yields recognition accuracies similar to other HMM-free
RNN-based approaches
A Hybrid Recurrent Neural Network For Music Transcription
We investigate the problem of incorporating higher-level symbolic score-like information into Automatic Music Transcription (AMT) systems to improve their performance. We use recurrent neural networks (RNNs) and their variants as music language models (MLMs) and present a generative architecture for combining these models with predictions from a frame level acoustic classifier. We also compare different neural network architectures for acoustic modeling. The proposed model computes a distribution over possible output sequences given the acoustic input signal and we present an algorithm for performing a global search for good candidate transcriptions. The performance of the proposed model is evaluated on piano music from the MAPS dataset and we observe that the proposed model consistently outperforms existing transcription methods
An End-to-End Neural Network for Polyphonic Music Transcription
We present a neural network model for polyphonic music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language mode}. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony or the number or type of instruments. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We investigate various neural network architectures for the acoustic models and compare their performance to two popular state-of-the-art acoustic models. We also present an efficient variant of beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications. We evaluate the model's performance on the MAPS dataset and show that the proposed model outperforms state-of-the-art transcription systems
An End-to-End Neural Network for Polyphonic Piano Music Transcription
We present a supervised neural network model for polyphonic piano music
transcription. The architecture of the proposed model is analogous to speech
recognition systems and comprises an acoustic model and a music language model.
The acoustic model is a neural network used for estimating the probabilities of
pitches in a frame of audio. The language model is a recurrent neural network
that models the correlations between pitch combinations over time. The proposed
model is general and can be used to transcribe polyphonic music without
imposing any constraints on the polyphony. The acoustic and language model
predictions are combined using a probabilistic graphical model. Inference over
the output variables is performed using the beam search algorithm. We perform
two sets of experiments. We investigate various neural network architectures
for the acoustic models and also investigate the effect of combining acoustic
and music language model predictions using the proposed architecture. We
compare performance of the neural network based acoustic models with two
popular unsupervised acoustic models. Results show that convolutional neural
network acoustic models yields the best performance across all evaluation
metrics. We also observe improved performance with the application of the music
language models. Finally, we present an efficient variant of beam search that
improves performance and reduces run-times by an order of magnitude, making the
model suitable for real-time applications
Learning and Evaluation Methodologies for Polyphonic Music Sequence Prediction with LSTMs
Music language models (MLMs) play an important role for various music signal and symbolic music processing tasks, such as music generation, symbolic music classification, or automatic music transcription (AMT). In this paper, we investigate Long Short-Term Memory (LSTM) networks for polyphonic music prediction, in the form of binary piano rolls. A preliminary experiment, assessing the influence of the timestep of piano rolls on system performance, highlights the need for more musical evaluation metrics. We introduce a range of metrics, focusing on temporal and harmonic aspects. We propose to combine them into a parametrisable loss to train our network. We then conduct a range of experiments with this new loss, both for polyphonic music prediction (intrinsic evaluation) and using our predictive model as a language model for AMT (extrinsic evaluation). Intrinsic evaluation shows that tuning the behaviour of a model is possible by adjusting loss parameters, with consistent results across timesteps. Extrinsic evaluation shows consistent behaviour across timesteps in terms of precision and recall with respect to the loss parameters, leading to an improvement in AMT performance without changing the complexity of the model. In particular, we show that intrinsic performance (in terms of cross entropy) is not related to extrinsic performance, highlighting the importance of using custom training losses for each specific application. Our model also compares favourably with previously proposed MLMs
ΠΠ½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ
This article presents an analytic survey of various end-to-end speech recognition systems, as well as some approaches to their construction and optimization. We consider models based on connectionist temporal classification (CTC), models based on encoder-decoder architecture with attention mechanism and models using conditional random field (CRF). We also describe integration possibilities with language models at a stage of decoding. We see that such an approach significantly reduces recognition error rates for end-to-end models. A survey of research works in this subject area reveals that end-to-end systems allow achieving results close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple configuration and demonstrate a high speed of learning and decoding. In addition, we consider popular frameworks and toolkits for creating speech recognition systems.ΠΡΠΈΠ²Π΅Π΄Π΅Π½ Π°Π½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΡΠ°Π·Π½ΠΎΠ²ΠΈΠ΄Π½ΠΎΡΡΠ΅ΠΉ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
(end-to-end) ΡΠΈΡΡΠ΅ΠΌ Π΄Π»Ρ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΈΡ
ΠΏΠΎΡΡΡΠΎΠ΅Π½ΠΈΡ, ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ ΠΈ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ. Π Π°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ Π²Π°ΡΠΈΠ°Π½ΡΡ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΊΠΎΠ½Π½Π΅ΠΊΡΠΈΠΎΠ½Π½ΠΎΠΉ Π²ΡΠ΅ΠΌΠ΅Π½Π½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ (CTC) Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅ ΡΡΠ½ΠΊΡΠΈΠΈ ΠΏΠΎΡΠ΅ΡΡ Π΄Π»Ρ Π½Π΅ΠΉΡΠΎΠ½Π½ΠΎΠΉ ΡΠ΅ΡΠΈ, ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΌΠ΅Ρ
Π°Π½ΠΈΠ·ΠΌΠ° Π²Π½ΠΈΠΌΠ°Π½ΠΈΡ ΠΈ ΡΠΈΡΡΠ°ΡΠΎΡ-Π΄Π΅ΡΠΈΡΡΠ°ΡΠΎΡ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. Π’Π°ΠΊΠΆΠ΅ ΡΠ°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ Π½Π΅ΠΉΡΠΎΠ½Π½ΡΠ΅ ΡΠ΅ΡΠΈ, ΠΏΠΎΡΡΡΠΎΠ΅Π½Π½ΡΠ΅ Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΡΡΠ»ΠΎΠ²Π½ΡΡ
ΡΠ»ΡΡΠ°ΠΉΠ½ΡΡ
ΠΏΠΎΠ»Π΅ΠΉ (CRF), ΠΊΠΎΡΠΎΡΡΠ΅ ΡΠ²Π»ΡΡΡΡΡ ΠΎΠ±ΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ΠΌ ΡΠΊΡΡΡΡΡ
ΠΌΠ°ΡΠΊΠΎΠ²ΡΠΊΠΈΡ
ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ, ΡΡΠΎ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΈΡΠΏΡΠ°Π²ΠΈΡΡ ΠΌΠ½ΠΎΠ³ΠΈΠ΅ Π½Π΅Π΄ΠΎΡΡΠ°ΡΠΊΠΈ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΡ
Π³ΠΈΠ±ΡΠΈΠ΄Π½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, ΠΏΡΠ΅Π΄ΠΏΠΎΠ»ΠΎΠΆΠ΅Π½ΠΈΠ΅ ΠΎ ΡΠΎΠΌ, ΡΡΠΎ ΡΠ»Π΅ΠΌΠ΅Π½ΡΡ Π²Ρ
ΠΎΠ΄Π½ΡΡ
ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΠΎΡΡΠ΅ΠΉ Π·Π²ΡΠΊΠΎΠ² ΡΠ΅ΡΠΈ ΡΠ²Π»ΡΡΡΡΡ Π½Π΅Π·Π°Π²ΠΈΡΠΈΠΌΡΠΌΠΈ ΡΠ»ΡΡΠ°ΠΉΠ½ΡΠΌΠΈ Π²Π΅Π»ΠΈΡΠΈΠ½Π°ΠΌΠΈ. Π’Π°ΠΊΠΆΠ΅ ΠΎΠΏΠΈΡΠ°Π½Ρ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΠΈ ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΠΈ Ρ ΡΠ·ΡΠΊΠΎΠ²ΡΠΌΠΈ ΠΌΠΎΠ΄Π΅Π»ΡΠΌΠΈ Π½Π° ΡΡΠ°ΠΏΠ΅ Π΄Π΅ΠΊΠΎΠ΄ΠΈΡΠΎΠ²Π°Π½ΠΈΡ, Π΄Π΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΡΡΠΈΠ΅ ΡΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠ΅ ΡΠΎΠΊΡΠ°ΡΠ΅Π½ΠΈΠ΅ ΠΎΡΠΈΠ±ΠΊΠΈ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π΄Π»Ρ ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. ΠΠΏΠΈΡΠ°Π½Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ ΠΌΠΎΠ΄ΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΈ ΡΠ»ΡΡΡΠ΅Π½ΠΈΡ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΡ
ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
Π°ΡΡ
ΠΈΡΠ΅ΠΊΡΡΡ ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, ΠΊΠ°ΠΊ, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, ΠΎΠ±ΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ ΠΊΠΎΠ½Π½Π΅ΠΊΡΠΈΠΎΠ½Π½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠΈ ΡΠ΅Π³ΡΠ»ΡΡΠΈΠ·Π°ΡΠΈΠΈ Π² ΠΌΠΎΠ΄Π΅Π»ΡΡ
, ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΡΡ
Π½Π° ΠΌΠ΅Ρ
Π°Π½ΠΈΠ·ΠΌΠ°Ρ
Π²Π½ΠΈΠΌΠ°Π½ΠΈΡ. ΠΠ±Π·ΠΎΡ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ, ΠΏΡΠΎΠ²ΠΎΠ΄ΠΈΠΌΡΡ
Π² Π΄Π°Π½Π½ΠΎΠΉ ΠΏΡΠ΅Π΄ΠΌΠ΅ΡΠ½ΠΎΠΉ ΠΎΠ±Π»Π°ΡΡΠΈ, ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°Π΅Ρ, ΡΡΠΎ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ Π΄ΠΎΡΡΠΈΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ², ΡΡΠ°Π²Π½ΠΈΠΌΡΡ
Ρ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ°ΠΌΠΈ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡΡΠΈΡ
ΡΠΊΡΡΡΡΠ΅ ΠΌΠ°ΡΠΊΠΎΠ²ΡΠΊΠΈΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ, Π½ΠΎ Ρ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ Π±ΠΎΠ»Π΅Π΅ ΠΏΡΠΎΡΡΠΎΠΉ ΠΊΠΎΠ½ΡΠΈΠ³ΡΡΠ°ΡΠΈΠΈ ΠΈ Π±ΡΡΡΡΠΎΠΉ ΡΠ°Π±ΠΎΡΠΎΠΉ ΡΠΈΡΡΠ΅ΠΌΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΠΊΠ°ΠΊ ΠΏΡΠΈ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠΈ, ΡΠ°ΠΊ ΠΈ ΠΏΡΠΈ Π΄Π΅ΠΊΠΎΠ΄ΠΈΡΠΎΠ²Π°Π½ΠΈΠΈ. Π Π°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΡΠ΅ ΠΈ ΡΠ°Π·Π²ΠΈΠ²Π°ΡΡΠΈΠ΅ΡΡ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ ΠΈ ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠ°ΡΠΈΠΈ Π΄Π»Ρ ΠΏΠΎΡΡΡΠΎΠ΅Π½ΠΈΡ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, ΡΠ°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ TensorFlow, Eesen, Kaldi ΠΈ Π΄ΡΡΠ³ΠΈΠ΅. ΠΡΠΎΠ²Π΅Π΄Π΅Π½ΠΎ ΡΡΠ°Π²Π½Π΅Π½ΠΈΠ΅ ΠΎΠΏΠΈΡΠ°Π½Π½ΡΡ
ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠ°ΡΠΈΠ΅Π² ΠΏΠΎ ΠΊΡΠΈΡΠ΅ΡΠΈΡΠΌ ΠΏΡΠΎΡΡΠΎΡΡ ΠΈ Π΄ΠΎΡΡΡΠΏΠ½ΠΎΡΡΠΈ ΠΈΡ
ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ Π΄Π»Ρ ΡΠ΅Π°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ