基于统计的汉语词性标注方法的研究与实现

Abstract

近年来,随着计算机技术的发展和可以获得的语料库数量的不断增大,基于统计的自然语言处理技术逐渐成为计算语言学中的一个研究热点。词性标注在许多应用领域中都是一个重要的实际问题,也是自然语言处理中的一个基础课题,对词性标注方法的研究具有很强的实际和理论意义。论文从各个方面对基于统计的汉语词性标注技术进行了探讨,实现了一个汉语自动词性标注系统。论文首先分析了汉语兼类词的特点,并讨论了汉语词类划分的依据和选择词性标记集的一些相关问题;给出了用于词性标注的自然语言n元语法模型,对基于动态规划的Viterbi标注算法进行了分析和描述。然后,论文从监督训练和非监督训练方式两方面对基于统计的汉语词性标注方法进行了研究。对于监督训练方式,论文首先实现了一种目前常用的相对频率训练标注模式-RF_Basic,并从词性概率矩阵与词汇概率矩阵的结构和数值变化等方面对训练集规模与标注正确率之间所存在的非线性关系做了分析,针对这种非线性关系,为了充分利用训练集,提高标注正确率,论文通过对RF_Basic模式下的标注结果的分析,从利用词语相关的语法属性,加强对易错词性词语和未知词的处理三个方面加以改进,得到了一个增强的监督训练标注模式-RF_Enhenced,提高了标注性能,封闭测试和开放测试的正确率分别达到96.5%和96%;对于非监督方式,目前国内还没有这方面的实验报告,为此,论文对非监督下的汉语词性标注做了一些分析。论文首先介绍了采用隐型马尔可夫模型(HMM)进行统计训练的Baum_Welch方法,实现了一个非监督的训练标注模式-HMM_Basic,然后从不同的初始模型的选择对系统标注性能的影响,讨论了其中所存在的问题。论文最后介绍了系统的整体结构,语词表、词性标记和分类词典的组织,对稀疏矩阵的处理等具体实现时的一些方法。In recent years, with the development of computer technology and more large corpus available the techniques of statistics-based natural language processing becomes one of the most actively researched project in computational linguistics. Part-of-Speech tagging is an important practical problem with potential applications in many areas and a basic question for discussion in Natural Language Processing. In this paper, we studied the statistics-based methods applied to chinese part-of-speech tagging from various aspects, and realized a Chinese par-of-speech tagging system. We first discussed the part of speech ambiguity phenomena of Chinese, the standard of determining chinese part of speech, and the related problems of selecting a tag set; We introduced the n-gram model used in statistical methods and the dynamic programming solution - Viterbi algorithm. Then, we studied the statistics-based part-of-speech tagging form supervised & unsupervised aspects; For supervised approach, we first realized a popular Relative Frequency training method -RF_-Basic, and studied the nonlinear relation between training set and tagging accuracies form the aspects of part-of-speech matrix & words matrix. Based on the error analysis of the basic training and tagging way we improved it form three aspects: using other grammatical attributes of words, strengthening the processing for the words easy to be tagged wrong and unknown words and got an enhanced supervised training and tagging way which increased the tagging accuracies. For the enhanced way, open test and close test showed that the over all accuracies are about 96.5% and 96%; For unsupervised approach, we first introduced the Baoum-Welch method used to train a Hidden Markov Model, and performed some experiments from various initial points. We analysed the results, giving some comments about the problems existing in the approach to Chinese part-of-speech tagging, such the tag set size, initial models etc. Finally we described the implementation techniques of our tagging system

    Similar works