41,393 research outputs found

    Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

    Full text link
    Neural models have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.Comment: NIPS 201

    Text detection and recognition in natural scene images

    Get PDF
    This thesis addresses the problem of end-to-end text detection and recognition in natural scene images based on deep neural networks. Scene text detection and recognition aim to find regions in an image that are considered as text by human beings, generate a bounding box for each word and output a corresponding sequence of characters. As a useful task in image analysis, scene text detection and recognition attract much attention in computer vision field. In this thesis, we tackle this problem by taking advantage of the success in deep learning techniques. Car license plates can be viewed as a spacial case of scene text, as they both consist of characters and appear in natural scenes. Nevertheless, they have their respective specificities. During the research progress, we start from car license plate detection and recognition. Then we extend the methods to general scene text, with additional ideas proposed. For both tasks, we develop two approaches respectively: a stepwise one and an integrated one. Stepwise methods tackle text detection and recognition step by step by respective models; while integrated methods handle both text detection and recognition simultaneously via one model. All approaches are based on the powerful deep Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), considering the tremendous breakthroughs they brought into the computer vision community. To begin with, a stepwise framework is proposed to tackle text detection and recognition, with its application to car license plates and general scene text respectively. A character CNN classifier is well trained to detect characters from an image in a sliding window manner. The detected characters are then grouped together as license plates or text lines according to some heuristic rules. A sequence labeling based method is proposed to recognize the whole license plate or text line without character level segmentation. On the basis of the sequence labeling based recognition method, to accelerate the processing speed, an integrated deep neural network is then proposed to address car license plate detection and recognition concurrently. It integrates both CNNs and RNNs in one network, and can be trained end-to-end. Both car license plate bounding boxes and their labels are generated in a single forward evaluation of the network. The whole process involves no heuristic rule, and avoids intermediate procedures like image cropping or feature recalculation, which not only prevents error accumulation, but also reduces computation burden. Lastly, the unified network is extended to simultaneous general text detection and recognition in natural scene. In contrast to the one for car license plates, some innovations are proposed to accommodate the special characteristics of general text. A varying-size RoI encoding method is proposed to handle the various aspect ratios of general text. An attention-based sequence-to-sequence learning structure is adopted for word recognition. It is expected that a character-level language model can be learnt in this manner. The whole framework can be trained end-to-end, requiring only images, the ground-truth bounding boxes and text labels. Through end-to-end training, the learned features can be more discriminative, which improves the overall performance. The convolutional features are calculated only once and shared by both detection and recognition, which saves the processing time. The proposed method has achieved state-of-the-art performance on several standard benchmark datasets.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201

    Emotion Recognition on Twitter Using Neural Networks

    Get PDF
    Deep learning has recently revolutionised many fields of natural language processing but has not yet been applied to emotion recognition. Most recent studies of emotion recognition on tweets used simple classifiers on a combination of bag-of-words and human-engineered features. Hence, we worked on improving emotion-recognition algorithms using neural networks. To this end, we created three large emotion-labelled data sets corresponding to Ekman's, Plutchik's, and POMS's emotions by exploiting Twitter's popular self-annotation mechanism — hashtags. We compared the performance of bag-of-words and latent semantic indexing models with the performance of neural networks. We trained several word- and character-based, recurrent and convolutional neural networks. Further, we investigated the transferability of final hidden state representations of neural networks: how appropriate is the representation trained on one classification for recognising another one? Finally, we developed a single model for recognising all three emotion classifications from a shared representation. We show that neural networks can surpass traditional text classification approaches for emotion recognition. Recurrent neural network working directly on characters without any text preprocessing in a completely end-to-end fashion was the most successful architecture. Although models trained on single data sets have revealed poor transferability, we improved the generality of final hidden state representation in the unison model. When training the unison model, the standard training heuristic yielded unbalanced performance, due to the vast difference in data set sizes. However, the newly proposed training strategy produced a unison model with performance comparable to that of single models

    Emotion Recognition on Twitter Using Neural Networks

    Get PDF
    Deep learning has recently revolutionised many fields of natural language processing but has not yet been applied to emotion recognition. Most recent studies of emotion recognition on tweets used simple classifiers on a combination of bag-of-words and human-engineered features. Hence, we worked on improving emotion-recognition algorithms using neural networks. To this end, we created three large emotion-labelled data sets corresponding to Ekman's, Plutchik's, and POMS's emotions by exploiting Twitter's popular self-annotation mechanism — hashtags. We compared the performance of bag-of-words and latent semantic indexing models with the performance of neural networks. We trained several word- and character-based, recurrent and convolutional neural networks. Further, we investigated the transferability of final hidden state representations of neural networks: how appropriate is the representation trained on one classification for recognising another one? Finally, we developed a single model for recognising all three emotion classifications from a shared representation. We show that neural networks can surpass traditional text classification approaches for emotion recognition. Recurrent neural network working directly on characters without any text preprocessing in a completely end-to-end fashion was the most successful architecture. Although models trained on single data sets have revealed poor transferability, we improved the generality of final hidden state representation in the unison model. When training the unison model, the standard training heuristic yielded unbalanced performance, due to the vast difference in data set sizes. However, the newly proposed training strategy produced a unison model with performance comparable to that of single models
    corecore