60 research outputs found
MASR: Metadata Aware Speech Representation
In the recent years, speech representation learning is constructed primarily
as a self-supervised learning (SSL) task, using the raw audio signal alone,
while ignoring the side-information that is often available for a given speech
recording. In this paper, we propose MASR, a Metadata Aware Speech
Representation learning framework, which addresses the aforementioned
limitations. MASR enables the inclusion of multiple external knowledge sources
to enhance the utilization of meta-data information. The external knowledge
sources are incorporated in the form of sample-level pair-wise similarity
matrices that are useful in a hard-mining loss. A key advantage of the MASR
framework is that it can be combined with any choice of SSL method. Using MASR
representations, we perform evaluations on several downstream tasks such as
language identification, speech recognition and other non-semantic tasks such
as speaker and emotion recognition. In these experiments, we illustrate
significant performance improvements for the MASR over other established
benchmarks. We perform a detailed analysis on the language identification task
to provide insights on how the proposed loss function enables the
representations to separate closely related languages
Embedded Speech Technology
openEnd-to-End models in Automatic Speech Recognition simplify the speech recognition process. They convert audio data directly into text representation without exploiting multiple stages and systems. This direct approach is efficient and reduces potential points of error. On the contrary, Sequence-to-Sequence models adopt a more integrative approach where they use distinct models for retrieving the acoustic and language-specific features, which are respectively known as acoustic and language models. This integration allows for better coordination between different speech aspects, potentially leading to more accurate transcriptions.
In this thesis, we explore various Speech-to-Text (STT) models, mainly focusing on End-to-End and Sequence-to-Sequence techniques. We also look into using offline STT tools such as Wav2Vec2.0, Kaldi and Vosk. These tools face challenges when handling new voice data or various accents of the same language. To address this challenge, we fine-tune the models to make them better at handling new, unseen data. Through our comparison, Wav2Vec2.0 emerged as the top performer, though with a larger model size. Our approach also proves that using Kaldi and Vosk together creates a robust STT system that can identify new words using phonemes
An Efficient Method for Number Plate Detection and Extraction Using White Pixel Detection (WPD) Method
Intelligent transport systems play an important role in supporting smart cities because of their promising applications in various areas, such as electronic toll collection, highway surveillance, urban logistics and traffic management. One of the key components of intelligent transport systems is vehicle license plate recognition, which enables the identification of each vehicle by recognizing the characters on its license plate through various image processing and computer vision techniques. Vehicle license plate recognition typically consists of smoothing image using median filter, White pixel detection (WPD), and number plate extraction. In this work an efficient White pixel detection method has been describing a license plates in various luminance conditions. Mostly we will focus on vehicle number plate detection along with the white pixel detection method we will use median filters and Line density filters to increase the detection accuracy for number plate. Subjective and objective quality assessment parameters will give us robustness of proposed work compared to state of License Plate Detection(LPD) techniques
A Novel Approach for Speech to Text Recognition System Using Hidden Markov Model
Speech recognition is the application of sophisticated algorithms which involve the transforming of the human voice to text. Speech identification is essential as it utilizes by several biometric identification systems and voice-controlled automation systems. Variations in recording equipment, speakers, situations, and environments make speech recognition a tough undertaking. Three major phases comprise speech recognition: speech pre-processing, feature extraction, and speech categorization. This work presents a comprehensive study with the objectives of comprehending, analyzing, and enhancing these models and approaches, such as Hidden Markov Models and Artificial Neural Networks, employed in the voice recognition system for feature extraction and classification
On the Effectiveness of Neural Text Generation based Data Augmentation for Recognition of Morphologically Rich Speech
Advanced neural network models have penetrated Automatic Speech Recognition
(ASR) in recent years, however, in language modeling many systems still rely on
traditional Back-off N-gram Language Models (BNLM) partly or entirely. The
reason for this are the high cost and complexity of training and using neural
language models, mostly possible by adding a second decoding pass (rescoring).
In our recent work we have significantly improved the online performance of a
conversational speech transcription system by transferring knowledge from a
Recurrent Neural Network Language Model (RNNLM) to the single pass BNLM with
text generation based data augmentation. In the present paper we analyze the
amount of transferable knowledge and demonstrate that the neural augmented LM
(RNN-BNLM) can help to capture almost 50% of the knowledge of the RNNLM yet by
dropping the second decoding pass and making the system real-time capable. We
also systematically compare word and subword LMs and show that subword-based
neural text augmentation can be especially beneficial in under-resourced
conditions. In addition, we show that using the RNN-BNLM in the first pass
followed by a neural second pass, offline ASR results can be even significantly
improved.Comment: 8 pages, 2 figures, accepted for publication at TSD 202
Deep Learning Methods for Industry and Healthcare
L'abstract è presente nell'allegato / the abstract is in the attachmen
- …