65 research outputs found

    Deep Learning for Scene Text Detection, Recognition, and Understanding

    Get PDF
    Detecting and recognizing texts in images is a long-standing task in computer vision. The goal of this task is to extract textual information from images and videos, such as recognizing license plates. Despite that the great progresses have been made in recent years, it still remains challenging due to the wide range of variations in text appearance. In this thesis, we aim to review the existing issues that hinder current Optical Character Recognition (OCR) development and explore potential solutions. Specifically, we first investigate the phenomenon of unfair comparisons between different OCR algorithms caused due to the lack of a consistent evaluation framework. Such an absence of a unified evaluation protocol leads to inconsistent and unreliable results, making it difficult to compare and improve upon existing methods. To tackle this issue, we design a new evaluation framework from the aspect of datasets, metrics, and models, enabling consistent and fair comparisons between OCR systems. Another issue existing in the field is the imbalanced distribution of training samples. In particular, the sample distribution largely depended on where and how the data was collected, and the resulting data bias may lead to poor performance and low generalizability on under-represented classes. To address this problem, we took the driving license plate recognition task as an example and proposed a text-to-image model that is able to synthesize photo-realistic text samples. By using this model, we synthesized more than one million samples to augment the training dataset, significantly improving the generalization capability of OCR models. Additionally, this thesis also explores the application of text vision question answering, which is a new and emerging research topic among the OCR community. This task challenges the OCR models to understand the relationships between the text and backgrounds and to answer the given questions. In this thesis, we propose to investigate evidence-based text VQA, which involves designing models that can provide reasonable evidence for their predictions, thus improving the generalization ability.Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 202

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Detection and Recognition of License Plates by Convolutional Neural Networks

    Get PDF
    The current advancements in machine intelligence have expedited the process of recognizing vehicles and other objects on the roads. The License Plate Recognition system (LPR) is an open challenge for many researchers to develop a reliable and accurate system for automatic license plate recognition. Several methods including Deep Learning techniques have been proposed recently for LPR, yet those methods are limited to specific regions or privately collected datasets. In this thesis, we propose an end-to-end Deep Convolutional Neural Network system for license plate recognition that is not limited to a specific region or country. We apply a modified version of YOLO v2 to first recognize the vehicle and then localize the license plate. Moreover, through the convolutional procedures, we improve an Optical Character Recognition network (OCR-Net) to recognize the license plate numbers and letters. Our method performs well for different vehicle types such as sedans, SUVs, buses, motorbikes, and trucks. The system works reliably on images of the front and rear views of the vehicle, and it also overcomes tilted or distorted license plate images and performs adequately under various illumination conditions, and noisy backgrounds. Several experiments have been carried out on various types of images from privately collected and publicly available datasets including OPEN-ALPR (BR, EU, US) which consists of 115 Brazilian, 108 European, and 222 North American images, CENPARMI includes 440 from Chinese, US, and different provinces of Canada and UFPR-ALPR includes 4500 Brazilian license plate images; images of those datasets have several challenges: i.e. single to multiple vehicles in an image, license plates of different countries, vehicles at different distances, and images taken by several types of cameras including cellphone cameras. Our experimental results show that the proposed system achieves 98.04% accuracy on average for OPEN-ALPR dataset, 88.5% for the more challenging CENPARMI dataset and 97.42% for UFPR-ALPR dataset respectively, outperforming the state-of-the-art commercial and academics

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    End-to-end Lip-reading: A Preliminary Study

    Get PDF
    Deep lip-reading is the combination of the domains of computer vision and natural language processing. It uses deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have not yet been able to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures. This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. Four main contributions have been made: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline1 to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal2 to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading

    Investigating face perception in humans and DCNNs

    Full text link
    This thesis aims to compare strengths and weaknesses of AI and humans performing face identification tasks, and to use recent advances in machine-learning to develop new techniques for understanding face identity processing. By better understanding underlying processing differences between Deep Convolutional Neural Networks (DCNNs) and humans, it can help improve the ways in which AI technology is used to support human decision-making and deepen understanding of face identity processing in humans and DCNNs. In Chapter 2, I test how the accuracy of humans and DCNNs is affected by image quality and find that humans and DCNNs are affected differently. This has important applied implications, for example, when identifying faces from poor-quality imagery in police investigations, and also points to different processing strategies used by humans and DCNNs. Given these diverging processing strategies, in Chapter 3, I investigate the potential for human and DCNN decisions to be combined in face identification decisions. I find a large overall benefit of 'fusing' algorithm and human face identity judgments, and that this depends on the idiosyncratic accuracy and response patterns of the particular DCNNs and humans in question. This points to new optimal ways that individual humans and DCNNs can be aggregated to improve the accuracy of face identity decisions in applied settings. Building on my background in computer vision, in Chapters 4 and 5, I then aim to better understand face information sampling by humans using a novel combination of eye-tracking and machine-learning approaches. In chapter 4, I develop exploratory methods for studying individual differences in face information sampling strategies. This reveals differences in the way that 'super-recognisers' sample face information compared to typical viewers. I then use DCNNs to assess the computational value of the face information sampled by these two groups of human observers, finding that sampling by 'super-recognisers' contains more computationally valuable face identity information. In Chapter 5, I develop a novel approach to measuring fixations to people in unconstrained natural settings by combining wearable eye-tracking technology with face and body detection algorithms. Together, these new approaches provide novel insight into individual differences in face information sampling, both when looking at faces in lab-based tasks performed on computer monitors and when looking at faces 'in the wild'

    End-to-End Deep Lip-reading: A Preliminary Study

    Get PDF
    Deep lip-reading is the use of deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have so far failed to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures (Martinez et al., 2020; Martinez et al., 2021). This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. We make four main contributions: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading

    Temporally Varying Weight Regression for Speech Recognition

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Hyperspectral Image Classification -- Traditional to Deep Models: A Survey for Future Prospects

    Get PDF
    Hyperspectral Imaging (HSI) has been extensively utilized in many real-life applications because it benefits from the detailed spectral information contained in each pixel. Notably, the complex characteristics i.e., the nonlinear relation among the captured spectral information and the corresponding object of HSI data make accurate classification challenging for traditional methods. In the last few years, Deep Learning (DL) has been substantiated as a powerful feature extractor that effectively addresses the nonlinear problems that appeared in a number of computer vision tasks. This prompts the deployment of DL for HSI classification (HSIC) which revealed good performance. This survey enlists a systematic overview of DL for HSIC and compared state-of-the-art strategies of the said topic. Primarily, we will encapsulate the main challenges of traditional machine learning for HSIC and then we will acquaint the superiority of DL to address these problems. This survey breakdown the state-of-the-art DL frameworks into spectral-features, spatial-features, and together spatial-spectral features to systematically analyze the achievements (future research directions as well) of these frameworks for HSIC. Moreover, we will consider the fact that DL requires a large number of labeled training examples whereas acquiring such a number for HSIC is challenging in terms of time and cost. Therefore, this survey discusses some strategies to improve the generalization performance of DL strategies which can provide some future guidelines
    corecore