737 research outputs found

    Chinese information processing

    Full text link
    A survey of the field of Chinese information processing is provided. It covers the following areas: the Chinese writing system, several popular Chinese encoding schemes and code conversions, Chinese keyboard entry methods, Chinese fonts, Chinese operating systems, basic Chinese computing techniques and applications

    Word segmentation for Akkadian cuneiform

    Get PDF
    We present experiments on word segmentation for Akkadian cuneiform, an ancient writing system and a language used for about 3 millennia in the ancient Near East. To our best knowledge, this is the first study of this kind applied to either the Akkadian language or the cuneiform writing system. As a logosyllabic writing system, cuneiform structurally resembles Eastern Asian writing systems, so, we employ word segmentation algorithms originally developed for Chinese and Japanese. We describe results of rule-based algorithms, dictionary-based algorithms, statistical and machine learning approaches. Our results may indicate possible promising steps in cuneiform word segmentation that can create and improve natural language processing in this area

    InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write

    Full text link
    Digital note-taking is gaining popularity, offering a durable, editable, and easily indexable way of storing notes in the vectorized form, known as digital ink. However, a substantial gap remains between this way of note-taking and traditional pen-and-paper note-taking, a practice still favored by a vast majority. Our work, InkSight, aims to bridge the gap by empowering physical note-takers to effortlessly convert their work (offline handwriting) to digital ink (online handwriting), a process we refer to as Derendering. Prior research on the topic has focused on the geometric properties of images, resulting in limited generalization beyond their training domains. Our approach combines reading and writing priors, allowing training a model in the absence of large amounts of paired samples, which are difficult to obtain. To our knowledge, this is the first work that effectively derenders handwritten text in arbitrary photos with diverse visual characteristics and backgrounds. Furthermore, it generalizes beyond its training domain into simple sketches. Our human evaluation reveals that 87% of the samples produced by our model on the challenging HierText dataset are considered as a valid tracing of the input image and 67% look like a pen trajectory traced by a human. Interactive visualizations of 100 word-level model outputs for each of the three public datasets are available in our Hugging Face space: https://huggingface.co/spaces/Derendering/Model-Output-Playground. Model release is in progress

    Character Recognition

    Get PDF
    Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field

    Article Segmentation in Digitised Newspapers

    Get PDF
    Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities

    Advances in Character Recognition

    Get PDF
    This book presents advances in character recognition, and it consists of 12 chapters that cover wide range of topics on different aspects of character recognition. Hopefully, this book will serve as a reference source for academic research, for professionals working in the character recognition field and for all interested in the subject

    The Challenges of Recognizing Offline Handwritten Chinese: A Technical Review

    Get PDF
    Offline handwritten Chinese recognition is an important research area of pattern recognition, including offline handwritten Chinese character recognition (offline HCCR) and offline handwritten Chinese text recognition (offline HCTR), which are closely related to daily life. With new deep learning techniques and the combination with other domain knowledge, offline handwritten Chinese recognition has gained breakthroughs in methods and performance in recent years. However, there have yet to be articles that provide a technical review of this field since 2016. In light of this, this paper reviews the research progress and challenges of offline handwritten Chinese recognition based on traditional techniques, deep learning methods, methods combining deep learning with traditional techniques, and knowledge from other areas from 2016 to 2022. Firstly, it introduces the research background and status of handwritten Chinese recognition, standard datasets, and evaluation metrics. Secondly, a comprehensive summary and analysis of offline HCCR and offline HCTR approaches during the last seven years is provided, along with an explanation of their concepts, specifics, and performances. Finally, the main research problems in this field over the past few years are presented. The challenges still exist in offline handwritten Chinese recognition are discussed, aiming to inspire future research work

    Arabic Typed Text Recognition in Graphics Images (ATTR-GI)

    Get PDF
    While optical character recognition (OCR) techniques may perform well on standard text documents, their performance degrades significantly in graphics images. In standard scanned text documents OCR techniques enjoy a number of convenient assumptions such as clear backgrounds, standard fonts, predefined line orientation, page size, the start point of written. These assumptions are not true in graphics documents such as Arabic advertisements, personal cards, screenshot. Therefore, in such types of images, greater attention is required in the initial stage of detecting Arabic text regions in order for subsequent character recognition steps to be successful. Special features of Arabic alphabet characters introduce additional challenges which are not present in Latin alphabet characters. In this research we propose a new technique for automatically detecting text in graphics documents, and preparing them for OCR processing. Our detection approach is based on some mathematical measurements to know is it a text or not and to know is it Arabic Based Text or Latin Based. These measurements are follows, measure the Base Line (the line has maximum number of black pixels). Also, measure Item Area (the content of extracted sub images). Finally, find maximum peak for the adjacent black pixels in Base line and maximum length for sub adjacent black pixels. Our experiment results will come in more details. We believe our technique will enable OCR systems to overcome their major shortcoming when dealing with text in graphics images. This will further enable a variety of OCR-based applications to extend their operation to graphics documents such as SPAM detection from image, reading advertisement for blind people, search and index document which contain image, enhancing for printer property (black white or color printer) and enhancing OCR
    • …
    corecore