292 research outputs found
Onsets and Velocities: Affordable Real-Time Piano Transcription Using Convolutional Neural Networks
Polyphonic Piano Transcription has recently experienced substantial progress,
driven by the use of sophisticated Deep Learning approaches and the
introduction of new subtasks such as note onset, offset, velocity and pedal
detection. This progress was coupled with an increased complexity and size of
the proposed models, typically relying on non-realtime components and
high-resolution data. In this work we focus on onset and velocity detection,
showing that a substantially smaller and simpler convolutional approach, using
lower temporal resolution (24ms), is still competitive: our proposed
ONSETS&VELOCITIES model achieves state-of-the-art performance on the MAESTRO
dataset for onset detection (F1=96.78%) and sets a good novel baseline for
onset+velocity (F1=94.50%), while having ~3.1M parameters and maintaining
real-time capabilities on modest commodity hardware. We provide open-source
code to reproduce our results and a real-time demo with a pretrained model.Comment: Accepted at EUSIPCO 202
Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information
This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech
Recent advances in LVCSR : A benchmark comparison of performances
Large Vocabulary Continuous Speech Recognition (LVCSR), which is characterized by a high variability of the speech, is the most challenging task in automatic speech recognition (ASR). Believing that the evaluation of ASR systems on relevant and common speech corpora is one of the key factors that help accelerating research, we present, in this paper, a benchmark comparison of the performances of the current state-of-the-art LVCSR systems over different speech recognition tasks. Furthermore, we put objectively into evidence the best performing technologies and the best accuracy achieved so far in each task. The benchmarks have shown that the Deep Neural Networks and Convolutional Neural Networks have proven their efficiency on several LVCSR tasks by outperforming the traditional Hidden Markov Models and Guaussian Mixture Models. They have also shown that despite the satisfying performances in some LVCSR tasks, the problem of large-vocabulary speech recognition is far from being solved in some others, where more research efforts are still needed
Spoken Term Detection on Low Resource Languages
Developing efficient speech processing systems for low-resource languages is an immensely challenging
problem. One potentially effective approach to address the lack of resources for any particular language, is to employ data from multiple languages for building speech processing sub-systems. This thesis investigates possible methodologies for Spoken Term Detection (STD) from low-
resource Indian languages. The task of STD intend to search for a query keyword, given in text form, from a considerably large speech database. This is usually done by matching templates of feature vectors, representing sequence of phonemes from the query word and the continuous speech from the database. Typical set of features used to represent speech signals in most of the speech processing systems are the mel frequency cepstral coefficients (MFCC). As speech is a very complexsignal, holding information about the textual message, speaker identity, emotional and health state of the speaker, etc., the MFCC features derived from it will also contain information about all these factors. For eficient template matching, we need to neutralize the speaker variability in features and stabilize them to represent the speech variability alone
The AMI System for the Transcription of Speech in Meetings
This paper describes the AMI transcription system for speech in
meetings developed in collaboration by five research groups. The
system includes generic techniques such as discriminative and speaker
adaptive training, vocal tract length normalisation, heteroscedastic
linear discriminant analysis, maximum likelihood linear regression,
and phone posterior based features, as well as techniques specifically
designed for meeting data. These include segmentation and
cross-talk suppression, beam-forming, domain adaptation, web-data
collection, and channel adaptive training. The system was improved
by more than 20% relative in word error rate compared to our previous
system and was used in the NIST RT’06 evaluations where it was
found to yield competitive performance
- …