74 research outputs found

    Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

    Full text link
    We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.Comment: 20 pages, 7 figures, 8 table

    Low-cost deep learning UAV and Raspberry Pi solution to real time pavement condition assessment

    Get PDF
    In this thesis, a real-time and low-cost solution to the autonomous condition assessment of pavement is proposed using deep learning, Unmanned Aerial Vehicle (UAV) and Raspberry Pi tiny computer technologies, which makes roads maintenance and renovation management more efficient and cost effective. A comparison study was conducted to compare the performance of seven different combinations of meta-architectures for pavement distress classification. It was observed that real-time object detection architecture SSD with MobileNet feature extractor is the best combination for real-time defect detection to be used by tiny computers. A low-cost Raspberry Pi smart defect detector camera was configured using the trained SSD MobileNet v1, which can be deployed with UAV for real-time and remote pavement condition assessment. The preliminary results show that the smart pavement detector camera achieves an accuracy of 60% at 1.2 frames per second in raspberry pi and 96% at 13.8 frames per second in CPU-based computer

    Artificial Intelligence in the Creative Industries: A Review

    Full text link
    This paper reviews the current state of the art in Artificial Intelligence (AI) technologies and applications in the context of the creative industries. A brief background of AI, and specifically Machine Learning (ML) algorithms, is provided including Convolutional Neural Network (CNNs), Generative Adversarial Networks (GANs), Recurrent Neural Networks (RNNs) and Deep Reinforcement Learning (DRL). We categorise creative applications into five groups related to how AI technologies are used: i) content creation, ii) information analysis, iii) content enhancement and post production workflows, iv) information extraction and enhancement, and v) data compression. We critically examine the successes and limitations of this rapidly advancing technology in each of these areas. We further differentiate between the use of AI as a creative tool and its potential as a creator in its own right. We foresee that, in the near future, machine learning-based AI will be adopted widely as a tool or collaborative assistant for creativity. In contrast, we observe that the successes of machine learning in domains with fewer constraints, where AI is the `creator', remain modest. The potential of AI (or its developers) to win awards for its original creations in competition with human creatives is also limited, based on contemporary technologies. We therefore conclude that, in the context of creative industries, maximum benefit from AI will be derived where its focus is human centric -- where it is designed to augment, rather than replace, human creativity

    Deep representation learning for speech recognition

    Get PDF
    Representation learning is a fundamental ingredient of deep learning. However, learning a good representation is a challenging task. For speech recognition, such a representation should contain the information needed to perform well in this task. A robust representation should also be reusable, hence it should capture the structure of the data. Interpretability is another desired characteristic. In this thesis we strive to learn an optimal deep representation for speech recognition using feed-forward Neural Networks (NNs) with different connectivity patterns. First and foremost, we aim to improve the robustness of the acoustic models. We use attribute-aware and adaptive training strategies to model the underlying factors of variation related to the speakers and the acoustic conditions. We focus on low-latency and real-time decoding scenarios. We explore different utterance summaries (referred to as utterance embeddings), capturing various sources of speech variability, and we seek to optimise speaker adaptive training (SAT) with control networks acting on the embeddings. We also propose a multi-scale CNN layer, to learn factorised representations. The proposed multi-scale approach also tackles the computational and memory efficiency. We also present a number of different approaches as an attempt to better understand learned representations. First, with a controlled design, we aim to assess the role of individual components of deep CNN acoustic models. Next, with saliency maps, we evaluate the importance of each input feature with respect to the classification criterion. Then, we propose to evaluate layer-wise and model-wise learned representations in different diagnostic verification tasks (speaker and acoustic condition verification). We propose a deep CNN model as the embedding extractor, merging the information learned at different layers in the network. Similarly, we perform the analyses for the embeddings used in SAT-DNNs to gain more insight. For the multi-scale models, we also show how to compare learned representations (and assess their robustness) with a metric invariant to affine transformations

    Learning speech embeddings for speaker adaptation and speech understanding

    Get PDF
    In recent years, deep neural network models gained popularity as a modeling approach for many speech processing tasks including automatic speech recognition (ASR) and spoken language understanding (SLU). In this dissertation, there are two main goals. The first goal is to propose modeling approaches in order to learn speaker embeddings for speaker adaptation or to learn semantic speech embeddings. The second goal is to introduce training objectives that achieve fairness for the ASR and SLU problems. In the case of speaker adaptation, we introduce an auxiliary network to an ASR model and learn to simultaneously detect speaker changes and adapt to the speaker in an unsupervised way. We show that this joint model leads to lower error rates as compared to a two-step approach where the signal is segmented into single speaker regions and then fed into an adaptation model. We then reformulate the speaker adaptation problem from a counterfactual fairness point-of-view and introduce objective functions to match the ASR performance of the individuals in the dataset to that of their counterfactual counterparts. We show that we can achieve lower error rate in an ASR system while reducing the performance disparity between protected groups. In the second half of the dissertation, we focus on SLU and tackle two problems associated with SLU datasets. The first SLU problem is the lack of large speech corpora. To handle this issue, we propose to use available non-parallel text data so that we can leverage the information in text to guide learning of the speech embeddings. We show that this technique increases the intent classification accuracy as compared to a speech-only system. The second SLU problem is the label imbalance problem in the datasets, which is also related to fairness since a model trained on skewed data usually leads to biased results. To achieve fair SLU, we propose to maximize the F-measure instead of conventional cross-entropy minimization and show that it is possible to increase the number of classes with nonzero recall. In the last two chapters, we provide additional discussions on the impact of these projects from both technical and social perspectives, propose directions for future research and summarize the findings

    On the semantic information in zero-shot action recognition

    Get PDF
    Orientador: Dr. David MenottiCoorientador: Dr. Hélio PedriniTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 14/04/2023Inclui referências: p. 117-132Área de concentração: Ciência da ComputaçãoResumo: Os avanços da última década em modelos de aprendizagem profunda aliados à alta disponibilidade de exemplos em plataformas como o YouTube foram responsáveis por notáveis progressos no problema de Reconhecimento de Ações Humanas (RAH) em vídeos. Esses avanços trouxeram o desafio da inclusão de novas classes aos modelos existentes, pois incluí-las é uma tarefa que demanda tempo e recursos computacionais. Além disso, novas classes de ações são frequentemente criadas pelo uso de novos objetos ou novas formas de interação entre humanos. Esse cenário é o que motiva o problema Zero-Shot Action Recognition (ZSAR), definido como classificar instâncias pertencentes a classes não disponíveis na fase de treinamento dos modelos. Métodos ZSAR objetivam aprender funções de projeção que relacionem as representações dos vídeos com as representações semânticas dos rótulos das classes conhecidas. Trata-se, portanto, de um problema de representação multi-modal. Nesta tese, investigamos o problema do semantic gap em ZSAR, ou seja, as propriedades dos espaços vetoriais das representações dos vídeos e dos rótulos não são coincidentes e, muitas vezes, as funções de projeção aprendidas são insuficientes para corrigir distorções. Nós defendemos que o semantic gap deriva do que chamamos semantic lack, ou falta de semântica, que ocorre em ambos os lados do problema (i.e., vídeos e rótulos) e não é suficientemente investigada na literatura. Apresentamos três abordagens ao problema investigando diferentes informações semânticas e formas de representação para vídeos e rótulos. Mostramos que uma forma eficiente de representar vídeos é transformando-os em sentenças descritivas utilizando métodos de video captioning. Essa abordagem permite descrever cenários, objetos e interações espaciais e temporais entre humanos. Nós mostramos que sua adoção gera modelos de alta eficácia comparados à literatura. Também propusemos incluir informações descritivas sobre os objetos presentes nas cenas a partir do uso de métodos treinados em reconhecimento de objetos. Mostramos que a representação dos rótulos de classes apresenta melhores resultados com o uso de sentenças extraídas de textos descritivos coletados da Internet. Ao usar apenas textos, nós nos valemos de modelos de redes neurais profundas pré-treinados na tarefa de paráfrase para codificar a informação e realizar a classificação ZSAR com reduzido semantic gap. Finalmente, mostramos como condicionar a representação dos quadros de um vídeo à sua correspondente descrição texual, produzindo um modelo capaz de representar em um espaço vetorial conjunto tanto vídeos quanto textos. As abordagens apresentadas nesta tese mostraram efetiva redução do semantic gap a partir das contribuições tanto em acréscimo de informação quanto em formas de codificação.Abstract: The advancements of the last decade in deep learning models and the high availability of examples on platforms such as YouTube were responsible for notable progress in the problem of Human Action Recognition (HAR) in videos. These advancements brought the challenge of adding new classes to existing models, since including them takes time and computational resources. In addition, new classes of actions are frequently created, either by using new objects or new forms of interaction between humans. This scenario motivates the Zero-Shot Action Recognition (ZSAR) problem, defined as classifying instances belonging to classes not available for the model training phase. ZSAR methods aim to learn projection functions associating video representations with semantic label representations of known classes. Therefore, it is a multi-modal representation problem. In this thesis, we investigate the semantic gap problem in ZSAR. The properties of vector spaces are not coincident, and, often, the projection functions learned are insufficient to correct distortions. We argue that the semantic gap derives from what we call semantic lack, which occurs on both sides of the problem (i.e., videos and labels) and is not sufficiently investigated in the literature. We present three approaches to the problem, investigating different information and representation strategies for videos and labels. We show an efficient way to represent videos by transforming them into descriptive sentences using video captioning methods. This approach enables us to produce high-performance models compared to the literature. We also proposed including descriptive information about objects present in the scenes using object recognition methods. We showed that the representation of class labels presents better results using sentences extracted from descriptive texts collected on the Internet. Using only texts, we employ deep neural network models pre-trained in the paraphrasing task to encode the information and perform the ZSAR classification with a reduced semantic gap. Finally, we show how conditioning the representation of video frames to their corresponding textual description produces a model capable of representing both videos and texts in a joint vector space. The approaches presented in this thesis showed an effective reduction of the semantic gap based on contributions in addition to information and representation ways
    corecore