14 research outputs found

    High-dimensional sequence transduction

    Full text link
    We investigate the problem of transforming an input sequence into a high-dimensional output sequence in order to transcribe polyphonic audio music into symbolic notation. We introduce a probabilistic model based on a recurrent neural network that is able to learn realistic output distributions given the input and we devise an efficient algorithm to search for the global mode of that distribution. The resulting method produces musically plausible transcriptions even under high levels of noise and drastically outperforms previous state-of-the-art approaches on five datasets of synthesized sounds and real recordings, approximately halving the test error rate

    End-to-End Attention-based Large Vocabulary Speech Recognition

    Full text link
    Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the acoustic modelling, language modelling and sequence decoding. We investigate a more direct approach in which the HMM is replaced with a Recurrent Neural Network (RNN) that performs sequence prediction directly at the character level. Alignment between the input features and the desired character sequence is learned automatically by an attention mechanism built into the RNN. For each predicted character, the attention mechanism scans the input sequence and chooses relevant frames. We propose two methods to speed up this operation: limiting the scan to a subset of most promising frames and pooling over time the information contained in neighboring frames, thereby reducing source sequence length. Integrating an n-gram language model into the decoding process yields recognition accuracies similar to other HMM-free RNN-based approaches

    A Hybrid Recurrent Neural Network For Music Transcription

    Get PDF
    We investigate the problem of incorporating higher-level symbolic score-like information into Automatic Music Transcription (AMT) systems to improve their performance. We use recurrent neural networks (RNNs) and their variants as music language models (MLMs) and present a generative architecture for combining these models with predictions from a frame level acoustic classifier. We also compare different neural network architectures for acoustic modeling. The proposed model computes a distribution over possible output sequences given the acoustic input signal and we present an algorithm for performing a global search for good candidate transcriptions. The performance of the proposed model is evaluated on piano music from the MAPS dataset and we observe that the proposed model consistently outperforms existing transcription methods

    An End-to-End Neural Network for Polyphonic Music Transcription

    Get PDF
    We present a neural network model for polyphonic music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language mode}. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony or the number or type of instruments. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We investigate various neural network architectures for the acoustic models and compare their performance to two popular state-of-the-art acoustic models. We also present an efficient variant of beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications. We evaluate the model's performance on the MAPS dataset and show that the proposed model outperforms state-of-the-art transcription systems

    An End-to-End Neural Network for Polyphonic Piano Music Transcription

    Get PDF
    We present a supervised neural network model for polyphonic piano music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language model. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We perform two sets of experiments. We investigate various neural network architectures for the acoustic models and also investigate the effect of combining acoustic and music language model predictions using the proposed architecture. We compare performance of the neural network based acoustic models with two popular unsupervised acoustic models. Results show that convolutional neural network acoustic models yields the best performance across all evaluation metrics. We also observe improved performance with the application of the music language models. Finally, we present an efficient variant of beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications

    Learning and Evaluation Methodologies for Polyphonic Music Sequence Prediction with LSTMs

    Get PDF
    Music language models (MLMs) play an important role for various music signal and symbolic music processing tasks, such as music generation, symbolic music classification, or automatic music transcription (AMT). In this paper, we investigate Long Short-Term Memory (LSTM) networks for polyphonic music prediction, in the form of binary piano rolls. A preliminary experiment, assessing the influence of the timestep of piano rolls on system performance, highlights the need for more musical evaluation metrics. We introduce a range of metrics, focusing on temporal and harmonic aspects. We propose to combine them into a parametrisable loss to train our network. We then conduct a range of experiments with this new loss, both for polyphonic music prediction (intrinsic evaluation) and using our predictive model as a language model for AMT (extrinsic evaluation). Intrinsic evaluation shows that tuning the behaviour of a model is possible by adjusting loss parameters, with consistent results across timesteps. Extrinsic evaluation shows consistent behaviour across timesteps in terms of precision and recall with respect to the loss parameters, leading to an improvement in AMT performance without changing the complexity of the model. In particular, we show that intrinsic performance (in terms of cross entropy) is not related to extrinsic performance, highlighting the importance of using custom training losses for each specific application. Our model also compares favourably with previously proposed MLMs

    АналитичСский ΠΎΠ±Π·ΠΎΡ€ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ

    Get PDF
    This article presents an analytic survey of various end-to-end speech recognition systems, as well as some approaches to their construction and optimization. We consider models based on connectionist temporal classification (CTC), models based on encoder-decoder architecture with attention mechanism and models using conditional random field (CRF). We also describe integration possibilities with language models at a stage of decoding. We see that such an approach significantly reduces recognition error rates for end-to-end models. A survey of research works in this subject area reveals that end-to-end systems allow achieving results close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple configuration and demonstrate a high speed of learning and decoding. In addition, we consider popular frameworks and toolkits for creating speech recognition systems.ΠŸΡ€ΠΈΠ²Π΅Π΄Π΅Π½ аналитичСский ΠΎΠ±Π·ΠΎΡ€ разновидностСй ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… (end-to-end) систСм для распознавания Ρ€Π΅Ρ‡ΠΈ, ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΈΡ… построСния, обучСния ΠΈ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ. РассмотрСны Π²Π°Ρ€ΠΈΠ°Π½Ρ‚Ρ‹ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ Π½Π° основС ΠΊΠΎΠ½Π½Π΅ΠΊΡ†ΠΈΠΎΠ½Π½ΠΎΠΉ Π²Ρ€Π΅ΠΌΠ΅Π½Π½ΠΎΠΉ классификации (CTC) Π² качСствС Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΈ ΠΏΠΎΡ‚Π΅Ρ€ΡŒ для Π½Π΅ΠΉΡ€ΠΎΠ½Π½ΠΎΠΉ сСти, ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° основС ΠΌΠ΅Ρ…Π°Π½ΠΈΠ·ΠΌΠ° внимания ΠΈ ΡˆΠΈΡ„Ρ€Π°Ρ‚ΠΎΡ€-Π΄Π΅ΡˆΠΈΡ„Ρ€Π°Ρ‚ΠΎΡ€ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. Π’Π°ΠΊΠΆΠ΅ рассмотрСны Π½Π΅ΠΉΡ€ΠΎΠ½Π½Ρ‹Π΅ сСти, построСнныС с использованиСм условных случайных ΠΏΠΎΠ»Π΅ΠΉ (CRF), ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΡΠ²Π»ΡΡŽΡ‚ΡΡ ΠΎΠ±ΠΎΠ±Ρ‰Π΅Π½ΠΈΠ΅ΠΌ скрытых марковских ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ, Ρ‡Ρ‚ΠΎ позволяСт ΠΈΡΠΏΡ€Π°Π²ΠΈΡ‚ΡŒ ΠΌΠ½ΠΎΠ³ΠΈΠ΅ нСдостатки стандартных Π³ΠΈΠ±Ρ€ΠΈΠ΄Π½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ, Π½Π°ΠΏΡ€ΠΈΠΌΠ΅Ρ€, ΠΏΡ€Π΅Π΄ΠΏΠΎΠ»ΠΎΠΆΠ΅Π½ΠΈΠ΅ ΠΎ Ρ‚ΠΎΠΌ, Ρ‡Ρ‚ΠΎ элСмСнты Π²Ρ…ΠΎΠ΄Π½Ρ‹Ρ… ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΠ½ΠΎΡΡ‚Π΅ΠΉ Π·Π²ΡƒΠΊΠΎΠ² Ρ€Π΅Ρ‡ΠΈ ΡΠ²Π»ΡΡŽΡ‚ΡΡ нСзависимыми случайными Π²Π΅Π»ΠΈΡ‡ΠΈΠ½Π°ΠΌΠΈ. Π’Π°ΠΊΠΆΠ΅ описаны возмоТности ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΠΈ с языковыми модСлями Π½Π° этапС дСкодирования, Π΄Π΅ΠΌΠΎΠ½ΡΡ‚Ρ€ΠΈΡ€ΡƒΡŽΡ‰ΠΈΠ΅ сущСствСнноС сокращСниС ошибки распознавания для ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. ΠžΠΏΠΈΡΠ°Π½Ρ‹ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ ΠΌΠΎΠ΄ΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΠΈ ΠΈ ΡƒΠ»ΡƒΡ‡ΡˆΠ΅Π½ΠΈΡ стандартных ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… Π°Ρ€Ρ…ΠΈΡ‚Π΅ΠΊΡ‚ΡƒΡ€ систСм распознавания Ρ€Π΅Ρ‡ΠΈ, ΠΊΠ°ΠΊ, Π½Π°ΠΏΡ€ΠΈΠΌΠ΅Ρ€, ΠΎΠ±ΠΎΠ±Ρ‰Π΅Π½ΠΈΠ΅ ΠΊΠΎΠ½Π½Π΅ΠΊΡ†ΠΈΠΎΠ½Π½ΠΎΠΉ классификации ΠΈ использовании рСгуляризации Π² модСлях, основанных Π½Π° ΠΌΠ΅Ρ…Π°Π½ΠΈΠ·ΠΌΠ°Ρ… внимания. ΠžΠ±Π·ΠΎΡ€ исслСдований, ΠΏΡ€ΠΎΠ²ΠΎΠ΄ΠΈΠΌΡ‹Ρ… Π² Π΄Π°Π½Π½ΠΎΠΉ ΠΏΡ€Π΅Π΄ΠΌΠ΅Ρ‚Π½ΠΎΠΉ области, ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°Π΅Ρ‚, Ρ‡Ρ‚ΠΎ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Π΅ систСмы распознавания Ρ€Π΅Ρ‡ΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‚ Π΄ΠΎΡΡ‚ΠΈΡ‡ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ², сравнимых с Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Π°ΠΌΠΈ стандартных систСм, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‰ΠΈΡ… скрытыС марковскиС ΠΌΠΎΠ΄Π΅Π»ΠΈ, Π½ΠΎ с ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ Π±ΠΎΠ»Π΅Π΅ простой ΠΊΠΎΠ½Ρ„ΠΈΠ³ΡƒΡ€Π°Ρ†ΠΈΠΈ ΠΈ быстрой Ρ€Π°Π±ΠΎΡ‚ΠΎΠΉ систСмы распознавания ΠΊΠ°ΠΊ ΠΏΡ€ΠΈ ΠΎΠ±ΡƒΡ‡Π΅Π½ΠΈΠΈ, Ρ‚Π°ΠΊ ΠΈ ΠΏΡ€ΠΈ Π΄Π΅ΠΊΠΎΠ΄ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠΈ. РассмотрСны Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ популярныС ΠΈ Ρ€Π°Π·Π²ΠΈΠ²Π°ΡŽΡ‰ΠΈΠ΅ΡΡ Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊΠΈ ΠΈ инструмСнтарии для построСния ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ, Ρ‚Π°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ TensorFlow, Eesen, Kaldi ΠΈ Π΄Ρ€ΡƒΠ³ΠΈΠ΅. ΠŸΡ€ΠΎΠ²Π΅Π΄Π΅Π½ΠΎ сравнСниС описанных инструмСнтариСв ΠΏΠΎ критСриям простоты ΠΈ доступности ΠΈΡ… использования для Ρ€Π΅Π°Π»ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈΠ½Ρ‚Π΅Π³Ρ€Π°Π»ΡŒΠ½Ρ‹Ρ… систСм распознавания Ρ€Π΅Ρ‡ΠΈ
    corecore