1,550 research outputs found

    New Perspectives in Sinographic Language Processing Through the Use of Character Structure

    Full text link
    Chinese characters have a complex and hierarchical graphical structure carrying both semantic and phonetic information. We use this structure to enhance the text model and obtain better results in standard NLP operations. First of all, to tackle the problem of graphical variation we define allographic classes of characters. Next, the relation of inclusion of a subcharacter in a characters, provides us with a directed graph of allographic classes. We provide this graph with two weights: semanticity (semantic relation between subcharacter and character) and phoneticity (phonetic relation) and calculate "most semantic subcharacter paths" for each character. Finally, adding the information contained in these paths to unigrams we claim to increase the efficiency of text mining methods. We evaluate our method on a text classification task on two corpora (Chinese and Japanese) of a total of 18 million characters and get an improvement of 3% on an already high baseline of 89.6% precision, obtained by a linear SVM classifier. Other possible applications and perspectives of the system are discussed.Comment: 17 pages, 5 figures, presented at CICLing 201

    Research on Event Extraction Model Based on Semantic Features of Chinese Words

    Get PDF
    Event Extraction (EE) is an important task in Natural Language Understanding (NLU). As the complexity of Chinese structure, Chinese EE is more difficult than English EE. According to the characteristics of Chinese, this paper designed a Semantic-GRU (Sem-GRU) model, which integrates Chinese word context semantics, Chinese word glyph semantics and Chinese word structure semantics. And this paper uses the model for Chinese Event Trigger Extraction (ETE) task. The experiment is compared in two tasks: ETE and Named Entity Recognition (NER). In ETE, the paper uses ACE 2005 Chinese event dataset to compare the existing research, the effect reaches 75.8 %. In NER, the paper uses MSRA dataset, which reaches 90.3 %, better than other models

    SCML: A Structural Representation for Chinese Characters

    Get PDF
    Chinese characters are used daily by well over a billion people. They constitute the main writing system of China and Taiwan, form a major part of written Japanese, and are also used in South Korea. Anything more than a cursory glance at these characters will reveal a high degree of structure to them, but computing systems do not currently have a means to operate on this structure. Existing character databases and dictionaries treat them as numerical code points, and associate with them additional `hand-computed\u27 data, such as stroke count, stroke order, and other information to aid in specific searches. Searching by a character\u27s `shape\u27 is effectively impossible in these systems. I propose a new approach to representing these characters, through an XML-based language called SCML. This language, by encoding an abstract form of a character, allows the direct retrieval of important information such as stroke count and stroke order, and permits useful but previously impossible automated analysis of characters. In addition, the system allows the design of a view that takes abstract SCML representations as character models and outputs glyphs based on an aesthetic, facilitating the creation of `meta-fonts\u27 for Chinese characters. Finally, through the creation of a specialized database, SCML allows for efficient structural character queries to be performed against the body of inserted characters, thus allowing people to search by the most obvious of a character\u27s characteristics: its shape

    Recognition of off-line handwritten cursive text

    Get PDF
    The author presents novel algorithms to design unconstrained handwriting recognition systems organized in three parts: In Part One, novel algorithms are presented for processing of Arabic text prior to recognition. Algorithms are described to convert a thinned image of a stroke to a straight line approximation. Novel heuristic algorithms and novel theorems are presented to determine start and end vertices of an off-line image of a stroke. A straight line approximation of an off-line stroke is converted to a one-dimensional representation by a novel algorithm which aims to recover the original sequence of writing. The resulting ordering of the stroke segments is a suitable preprocessed representation for subsequent handwriting recognition algorithms as it helps to segment the stroke. The algorithm was tested against one data set of isolated handwritten characters and another data set of cursive handwriting, each provided by 20 subjects, and has been 91.9% and 91.8% successful for these two data sets, respectively. In Part Two, an entirely novel fuzzy set-sequential machine character recognition system is presented. Fuzzy sequential machines are defined to work as recognizers of handwritten strokes. An algorithm to obtain a deterministic fuzzy sequential machine from a stroke representation, that is capable of recognizing that stroke and its variants, is presented. An algorithm is developed to merge two fuzzy machines into one machine. The learning algorithm is a combination of many described algorithms. The system was tested against isolated handwritten characters provided by 20 subjects resulting in 95.8% recognition rate which is encouraging and shows that the system is highly flexible in dealing with shape and size variations. In Part Three, also an entirely novel text recognition system, capable of recognizing off-line handwritten Arabic cursive text having a high variability is presented. This system is an extension of the above recognition system. Tokens are extracted from a onedimensional representation of a stroke. Fuzzy sequential machines are defined to work as recognizers of tokens. It is shown how to obtain a deterministic fuzzy sequential machine from a token representation that is capable'of recognizing that token and its variants. An algorithm for token learning is presented. The tokens of a stroke are re-combined to meaningful strings of tokens. Algorithms to recognize and learn token strings are described. The. recognition stage uses algorithms of the learning stage. The process of extracting the best set of basic shapes which represent the best set of token strings that constitute an unknown stroke is described. A method is developed to extract lines from pages of handwritten text, arrange main strokes of extracted lines in the same order as they were written, and present secondary strokes to main strokes. Presented secondary strokes are combined with basic shapes to obtain the final characters by formulating and solving assignment problems for this purpose. Some secondary strokes which remain unassigned are individually manipulated. The system was tested against the handwritings of 20 subjects yielding overall subword and character recognition rates of 55.4% and 51.1%, respectively

    Feature Extraction Methods for Character Recognition

    Get PDF
    Not Include

    A Sketch-Based Educational System for Learning Chinese Handwriting

    Get PDF
    Learning Chinese as a Second Language (CSL) is a difficult task for students in English-speaking countries due to the large symbol set and complicated writing techniques. Traditional classroom methods of teaching Chinese handwriting have major drawbacks due to human experts’ bias and the lack of assessment on writing techniques. In this work, we propose a sketch-based educational system to help CSL students learn Chinese handwriting faster and better in a novel way. Our system allows students to draw freehand symbols to answer questions, and uses sketch recognition and AI techniques to recognize, assess, and provide feedback in real time. Results have shown that the system reaches a recognition accuracy of 86% on novice learners’ inputs, higher than 95% detection rate for mistakes in writing techniques, and 80.3% F-measure on the classification between expert and novice handwriting inputs

    Character Recognition

    Get PDF
    Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field

    Automated Building of Sentence-Level Parallel Corpus and Chinese-Hungarian Dictionary

    Get PDF
    Decades of work have been conducted on automated building of parallel corpus and automatic dictionary in the field of natural language processing. However, rarely have any studies been done between high-density character-based languages and medium-density word-based languages due to the lack of resources and fundamental linguistic differences. In this paper, we describe a methodology for creating a sentence-level paralleled corpus and an automatic bilingual dictionary between Chinese (a high-density character-based language) and Hungarian (a medium-density word-based language). This method will possibly be applied to create Chinese-Hungarian bilingual dictionary for the Sztaki Dictionary project [http://szotar.sztaki.hu/]
    • …
    corecore