92 research outputs found

    The Challenges of Recognizing Offline Handwritten Chinese: A Technical Review

    Get PDF
    Offline handwritten Chinese recognition is an important research area of pattern recognition, including offline handwritten Chinese character recognition (offline HCCR) and offline handwritten Chinese text recognition (offline HCTR), which are closely related to daily life. With new deep learning techniques and the combination with other domain knowledge, offline handwritten Chinese recognition has gained breakthroughs in methods and performance in recent years. However, there have yet to be articles that provide a technical review of this field since 2016. In light of this, this paper reviews the research progress and challenges of offline handwritten Chinese recognition based on traditional techniques, deep learning methods, methods combining deep learning with traditional techniques, and knowledge from other areas from 2016 to 2022. Firstly, it introduces the research background and status of handwritten Chinese recognition, standard datasets, and evaluation metrics. Secondly, a comprehensive summary and analysis of offline HCCR and offline HCTR approaches during the last seven years is provided, along with an explanation of their concepts, specifics, and performances. Finally, the main research problems in this field over the past few years are presented. The challenges still exist in offline handwritten Chinese recognition are discussed, aiming to inspire future research work

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Recognizing Visual Object Using Machine Learning Techniques

    Get PDF
    Nowadays, Visual Object Recognition (VOR) has received growing interest from researchers and it has become a very active area of research due to its vital applications including handwriting recognition, diseases classification, face identification ..etc. However, extracting the relevant features that faithfully describe the image represents the challenge of most existing VOR systems. This thesis is mainly dedicated to the development of two VOR systems, which are presented in two different contributions. As a first contribution, we propose a novel generic feature-independent pyramid multilevel (GFIPML) model for extracting features from images. GFIPML addresses the shortcomings of two existing schemes namely multi-level (ML) and pyramid multi-level (PML), while also taking advantage of their pros. As its name indicates, the proposed model can be used by any kind of the large variety of existing features extraction methods. We applied GFIPML for the task of Arabic literal amount recognition. Indeed, this task is challenging due to the specific characteristics of Arabic handwriting. While most literary works have considered structural features that are sensitive to word deformations, we opt for using Local Phase Quantization (LPQ) and Binarized Statistical Image Feature (BSIF) as Arabic handwriting can be considered as texture. To further enhance the recognition yields, we considered a multimodal system based on the combination of LPQ with multiple BSIF descriptors, each one with a different filter size. As a second contribution, a novel simple yet effcient, and speedy TR-ICANet model for extracting features from unconstrained ear images is proposed. To get rid of unconstrained conditions (e.g., scale and pose variations), we suggested first normalizing all images using CNN. The normalized images are fed then to the TR-ICANet model, which uses ICA to learn filters. A binary hashing and block-wise histogramming are used then to compute the local features. At the final stage of TR-ICANet, we proposed to use an effective normalization method namely Tied Rank normalization in order to eliminate the disparity within blockwise feature vectors. Furthermore, to improve the identification performance of the proposed system, we proposed a softmax average fusing of CNN-based feature extraction approaches with our proposed TR-ICANet at the decision level using SVM classifier

    TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

    Full text link
    Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two subparts: the text reading part for obtaining the plain text from the original document images and the information extraction part for extracting key contents. These methods mainly focus on improving the second, while neglecting that the two parts are highly correlated. This paper proposes a unified end-to-end information extraction framework from visually rich documents, where text reading and information extraction can reinforce each other via a well-designed multi-modal context block. Specifically, the text reading part provides multi-modal features like visual, textual and layout features. The multi-modal context block is developed to fuse the generated multi-modal features and even the prior knowledge from the pre-trained language model for better semantic representation. The information extraction part is responsible for generating key contents with the fused context features. The framework can be trained in an end-to-end trainable manner, achieving global optimization. What is more, we define and group visually rich documents into four categories across two dimensions, the layout and text type. For each document category, we provide or recommend the corresponding benchmarks, experimental settings and strong baselines for remedying the problem that this research area lacks the uniform evaluation standard. Extensive experiments on four kinds of benchmarks (from fixed layout to variable layout, from full-structured text to semi-unstructured text) are reported, demonstrating the proposed method's effectiveness. Data, source code and models are available

    Character Recognition

    Get PDF
    Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field

    Deep Learning for Scene Text Detection, Recognition, and Understanding

    Get PDF
    Detecting and recognizing texts in images is a long-standing task in computer vision. The goal of this task is to extract textual information from images and videos, such as recognizing license plates. Despite that the great progresses have been made in recent years, it still remains challenging due to the wide range of variations in text appearance. In this thesis, we aim to review the existing issues that hinder current Optical Character Recognition (OCR) development and explore potential solutions. Specifically, we first investigate the phenomenon of unfair comparisons between different OCR algorithms caused due to the lack of a consistent evaluation framework. Such an absence of a unified evaluation protocol leads to inconsistent and unreliable results, making it difficult to compare and improve upon existing methods. To tackle this issue, we design a new evaluation framework from the aspect of datasets, metrics, and models, enabling consistent and fair comparisons between OCR systems. Another issue existing in the field is the imbalanced distribution of training samples. In particular, the sample distribution largely depended on where and how the data was collected, and the resulting data bias may lead to poor performance and low generalizability on under-represented classes. To address this problem, we took the driving license plate recognition task as an example and proposed a text-to-image model that is able to synthesize photo-realistic text samples. By using this model, we synthesized more than one million samples to augment the training dataset, significantly improving the generalization capability of OCR models. Additionally, this thesis also explores the application of text vision question answering, which is a new and emerging research topic among the OCR community. This task challenges the OCR models to understand the relationships between the text and backgrounds and to answer the given questions. In this thesis, we propose to investigate evidence-based text VQA, which involves designing models that can provide reasonable evidence for their predictions, thus improving the generalization ability.Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 202

    Multimodal interaction with mobile devices : fusing a broad spectrum of modality combinations

    Get PDF
    This dissertation presents a multimodal architecture for use in mobile scenarios such as shopping and navigation. It also analyses a wide range of feasible modality input combinations for these contexts. For this purpose, two interlinked demonstrators were designed for stand-alone use on mobile devices. Of particular importance was the design and implementation of a modality fusion module capable of combining input from a range of communication modes like speech, handwriting, and gesture. The implementation is able to account for confidence value biases arising within and between modalities and also provides a method for resolving semantically overlapped input. Tangible interaction with real-world objects and symmetric multimodality are two further themes addressed in this work. The work concludes with the results from two usability field studies that provide insight on user preference and modality intuition for different modality combinations, as well as user acceptance for anthropomorphized objects.Diese Dissertation prĂ€sentiert eine multimodale Architektur zum Gebrauch in mobilen UmstĂ€nden wie z. B. Einkaufen und Navigation. Außerdem wird ein großes Gebiet von möglichen modalen Eingabekombinationen zu diesen UmstĂ€nden analysiert. Um das in praktischer Weise zu demonstrieren, wurden zwei teilweise gekoppelte VorfĂŒhrungsprogramme zum \u27stand-alone\u27; Gebrauch auf mobilen GerĂ€ten entworfen. Von spezieller Wichtigkeit war der Entwurf und die AusfĂŒhrung eines ModalitĂ€ts-fusion Modul, das die Kombination einer Reihe von Kommunikationsarten wie Sprache, Handschrift und Gesten ermöglicht. Die AusfĂŒhrung erlaubt die VerĂ€nderung von ZuverlĂ€ssigkeitswerten innerhalb einzelner ModalitĂ€ten und außerdem ermöglicht eine Methode um die semantisch ĂŒberlappten Eingaben auszuwerten. Wirklichkeitsnaher Dialog mit aktuellen Objekten und symmetrische MultimodalitĂ€t sind zwei weitere Themen die in dieser Arbeit behandelt werden. Die Arbeit schließt mit Resultaten von zwei Feldstudien, die weitere Einsicht erlauben ĂŒber die bevorzugte Art verschiedener ModalitĂ€tskombinationen, sowie auch ĂŒber die Akzeptanz von anthropomorphisierten Objekten

    Multimodal interaction with mobile devices : fusing a broad spectrum of modality combinations

    Get PDF
    This dissertation presents a multimodal architecture for use in mobile scenarios such as shopping and navigation. It also analyses a wide range of feasible modality input combinations for these contexts. For this purpose, two interlinked demonstrators were designed for stand-alone use on mobile devices. Of particular importance was the design and implementation of a modality fusion module capable of combining input from a range of communication modes like speech, handwriting, and gesture. The implementation is able to account for confidence value biases arising within and between modalities and also provides a method for resolving semantically overlapped input. Tangible interaction with real-world objects and symmetric multimodality are two further themes addressed in this work. The work concludes with the results from two usability field studies that provide insight on user preference and modality intuition for different modality combinations, as well as user acceptance for anthropomorphized objects.Diese Dissertation prĂ€sentiert eine multimodale Architektur zum Gebrauch in mobilen UmstĂ€nden wie z. B. Einkaufen und Navigation. Außerdem wird ein großes Gebiet von möglichen modalen Eingabekombinationen zu diesen UmstĂ€nden analysiert. Um das in praktischer Weise zu demonstrieren, wurden zwei teilweise gekoppelte VorfĂŒhrungsprogramme zum 'stand-alone'; Gebrauch auf mobilen GerĂ€ten entworfen. Von spezieller Wichtigkeit war der Entwurf und die AusfĂŒhrung eines ModalitĂ€ts-fusion Modul, das die Kombination einer Reihe von Kommunikationsarten wie Sprache, Handschrift und Gesten ermöglicht. Die AusfĂŒhrung erlaubt die VerĂ€nderung von ZuverlĂ€ssigkeitswerten innerhalb einzelner ModalitĂ€ten und außerdem ermöglicht eine Methode um die semantisch ĂŒberlappten Eingaben auszuwerten. Wirklichkeitsnaher Dialog mit aktuellen Objekten und symmetrische MultimodalitĂ€t sind zwei weitere Themen die in dieser Arbeit behandelt werden. Die Arbeit schließt mit Resultaten von zwei Feldstudien, die weitere Einsicht erlauben ĂŒber die bevorzugte Art verschiedener ModalitĂ€tskombinationen, sowie auch ĂŒber die Akzeptanz von anthropomorphisierten Objekten
    • 

    corecore