3 research outputs found
Tracing a Loose Wordhood for Chinese Input Method Engine
Chinese input methods are used to convert pinyin sequence or other Latin
encoding systems into Chinese character sentences. For more effective
pinyin-to-character conversion, typical Input Method Engines (IMEs) rely on a
predefined vocabulary that demands manually maintenance on schedule. For the
purpose of removing the inconvenient vocabulary setting, this work focuses on
automatic wordhood acquisition by fully considering that Chinese inputting is a
free human-computer interaction procedure. Instead of strictly defining words,
a loose word likelihood is introduced for measuring how likely a character
sequence can be a user-recognized word with respect to using IME. Then an
online algorithm is proposed to adjust the word likelihood or generate new
words by comparing user true choice for inputting and the algorithm prediction.
The experimental results show that the proposed solution can agilely adapt to
diverse typings and demonstrate performance approaching highly-optimized IME
with fixed vocabulary
SinSpell: A Comprehensive Spelling Checker for Sinhala
We have built SinSpell, a comprehensive spelling checker for the Sinhala
language which is spoken by over 16 million people, mainly in Sri Lanka.
However, until recently, Sinhala had no spelling checker with acceptable
coverage. Sinspell is still the only open source Sinhala spelling checker.
SinSpell identifies possible spelling errors and suggests corrections. It also
contains a module which auto-corrects evident errors. To maintain accuracy,
SinSpell was designed as a rule-based system based on Hunspell. A set of words
was compiled from several sources and verified. These were divided into
morphological classes, and the valid roots, suffixes and prefixes for each
class were identified, together with lists of irregular words and exceptions.
The errors in a corpus of Sinhala documents were analysed and commonly
misspelled words and types of common errors were identified. We found that the
most common errors were in vowel length and similar sounding letters. Errors
due to incorrect typing and encoding were also found. This analysis was used to
develop the suggestion generator and auto-corrector
Chinese Spelling Error Detection Using a Fusion Lattice LSTM
Spelling error detection serves as a crucial preprocessing in many natural
language processing applications. Due to the characteristics of Chinese
Language, Chinese spelling error detection is more challenging than error
detection in English. Existing methods are mainly under a pipeline framework,
which artificially divides error detection process into two steps. Thus, these
methods bring error propagation and cannot always work well due to the
complexity of the language environment. Besides existing methods only adopt
character or word information, and ignore the positive effect of fusing
character, word, pinyin1 information together. We propose an LF-LSTM-CRF model,
which is an extension of the LSTMCRF with word lattices and
character-pinyin-fusion inputs. Our model takes advantage of the end-to-end
framework to detect errors as a whole process, and dynamically integrates
character, word and pinyin information. Experiments on the SIGHAN data show
that our LF-LSTM-CRF outperforms existing methods with similar external
resources consistently, and confirm the feasibility of adopting the end-to-end
framework and the availability of integrating of character, word and pinyin
information.Comment: 8 pages,5 figure