1,550 research outputs found
New Perspectives in Sinographic Language Processing Through the Use of Character Structure
Chinese characters have a complex and hierarchical graphical structure
carrying both semantic and phonetic information. We use this structure to
enhance the text model and obtain better results in standard NLP operations.
First of all, to tackle the problem of graphical variation we define
allographic classes of characters. Next, the relation of inclusion of a
subcharacter in a characters, provides us with a directed graph of allographic
classes. We provide this graph with two weights: semanticity (semantic relation
between subcharacter and character) and phoneticity (phonetic relation) and
calculate "most semantic subcharacter paths" for each character. Finally,
adding the information contained in these paths to unigrams we claim to
increase the efficiency of text mining methods. We evaluate our method on a
text classification task on two corpora (Chinese and Japanese) of a total of 18
million characters and get an improvement of 3% on an already high baseline of
89.6% precision, obtained by a linear SVM classifier. Other possible
applications and perspectives of the system are discussed.Comment: 17 pages, 5 figures, presented at CICLing 201
Research on Event Extraction Model Based on Semantic Features of Chinese Words
Event Extraction (EE) is an important task in Natural Language Understanding (NLU). As the complexity of Chinese structure, Chinese EE is more difficult than English EE. According to the characteristics of Chinese, this paper designed a Semantic-GRU (Sem-GRU) model, which integrates Chinese word context semantics, Chinese word glyph semantics and Chinese word structure semantics. And this paper uses the model for Chinese Event Trigger Extraction (ETE) task. The experiment is compared in two tasks: ETE and Named Entity Recognition (NER). In ETE, the paper uses ACE 2005 Chinese event dataset to compare the existing research, the effect reaches 75.8 %. In NER, the paper uses MSRA dataset, which reaches 90.3 %, better than other models
SCML: A Structural Representation for Chinese Characters
Chinese characters are used daily by well over a billion people. They constitute the main writing system of China and Taiwan, form a major part of written Japanese, and are also used in South Korea. Anything more than a cursory glance at these characters will reveal a high degree of structure to them, but computing systems do not currently have a means to operate on this structure. Existing character databases and dictionaries treat them as numerical code points, and associate with them additional `hand-computed\u27 data, such as stroke count, stroke order, and other information to aid in specific searches. Searching by a character\u27s `shape\u27 is effectively impossible in these systems. I propose a new approach to representing these characters, through an XML-based language called SCML. This language, by encoding an abstract form of a character, allows the direct retrieval of important information such as stroke count and stroke order, and permits useful but previously impossible automated analysis of characters. In addition, the system allows the design of a view that takes abstract SCML representations as character models and outputs glyphs based on an aesthetic, facilitating the creation of `meta-fonts\u27 for Chinese characters. Finally, through the creation of a specialized database, SCML allows for efficient structural character queries to be performed against the body of inserted characters, thus allowing people to search by the most obvious of a character\u27s characteristics: its shape
Recognition of off-line handwritten cursive text
The author presents novel algorithms to design unconstrained handwriting
recognition systems organized in three parts:
In Part One, novel algorithms are presented for processing of Arabic text prior to
recognition. Algorithms are described to convert a thinned image of a stroke to a straight
line approximation. Novel heuristic algorithms and novel theorems are presented to
determine start and end vertices of an off-line image of a stroke. A straight line
approximation of an off-line stroke is converted to a one-dimensional representation by
a novel algorithm which aims to recover the original sequence of writing. The resulting
ordering of the stroke segments is a suitable preprocessed representation for subsequent
handwriting recognition algorithms as it helps to segment the stroke. The algorithm was
tested against one data set of isolated handwritten characters and another data set of
cursive handwriting, each provided by 20 subjects, and has been 91.9% and 91.8%
successful for these two data sets, respectively.
In Part Two, an entirely novel fuzzy set-sequential machine character recognition
system is presented. Fuzzy sequential machines are defined to work as recognizers of
handwritten strokes. An algorithm to obtain a deterministic fuzzy sequential machine from
a stroke representation, that is capable of recognizing that stroke and its variants, is
presented. An algorithm is developed to merge two fuzzy machines into one machine. The
learning algorithm is a combination of many described algorithms. The system was tested
against isolated handwritten characters provided by 20 subjects resulting in 95.8%
recognition rate which is encouraging and shows that the system is highly flexible in
dealing with shape and size variations.
In Part Three, also an entirely novel text recognition system, capable of recognizing
off-line handwritten Arabic cursive text having a high variability is presented. This system
is an extension of the above recognition system. Tokens are extracted from a onedimensional
representation of a stroke. Fuzzy sequential machines are defined to work as
recognizers of tokens. It is shown how to obtain a deterministic fuzzy sequential machine
from a token representation that is capable'of recognizing that token and its variants. An
algorithm for token learning is presented. The tokens of a stroke are re-combined to
meaningful strings of tokens. Algorithms to recognize and learn token strings are
described. The. recognition stage uses algorithms of the learning stage. The process of
extracting the best set of basic shapes which represent the best set of token strings that
constitute an unknown stroke is described. A method is developed to extract lines from
pages of handwritten text, arrange main strokes of extracted lines in the same order as
they were written, and present secondary strokes to main strokes. Presented secondary
strokes are combined with basic shapes to obtain the final characters by formulating and
solving assignment problems for this purpose. Some secondary strokes which remain
unassigned are individually manipulated. The system was tested against the handwritings
of 20 subjects yielding overall subword and character recognition rates of 55.4% and
51.1%, respectively
A Sketch-Based Educational System for Learning Chinese Handwriting
Learning Chinese as a Second Language (CSL) is a difficult task for students in English-speaking countries due to the large symbol set and complicated writing techniques. Traditional classroom methods of teaching Chinese handwriting have major drawbacks due to human experts’ bias and the lack of assessment on writing techniques. In this work, we propose a sketch-based educational system to help CSL students learn Chinese handwriting faster and better in a novel way. Our system allows students to draw freehand symbols to answer questions, and uses sketch recognition and AI techniques to recognize, assess, and provide feedback in real time. Results have shown that the system reaches a recognition accuracy of 86% on novice learners’ inputs, higher than 95% detection rate for mistakes in writing techniques, and 80.3% F-measure on the classification between expert and novice handwriting inputs
Character Recognition
Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field
Automated Building of Sentence-Level Parallel Corpus and Chinese-Hungarian Dictionary
Decades of work have been conducted on automated building of parallel corpus and automatic dictionary in the field of natural language processing. However, rarely have any studies been done between high-density character-based languages and medium-density word-based languages due to the lack of resources and fundamental linguistic differences. In this paper, we describe a methodology for creating a sentence-level paralleled corpus and an automatic bilingual dictionary between Chinese (a high-density character-based language) and Hungarian (a medium-density word-based language). This method will possibly be applied to create Chinese-Hungarian bilingual dictionary for the Sztaki Dictionary project [http://szotar.sztaki.hu/]
- …