6 research outputs found
Recommended from our members
An Entropy-based Assessment of the Unicode Encoding for Tibetan
This paper presents an analysis of the Unicode encoding scheme for Tibetan from the standpoint of morpheme entropy. We can speak of two levels of entropy in Tibetan: syllable entropy (a measure of the probability of the sequential occurrence of syllables), and morpheme entropy (a measure of the probability of the sequential occurrence of characters or morphemes), the latter being a measure of the redundancy of the language. Syllable entropy is a purely statistical calculation that is a function of the domain of the literature sampled, while morpheme entropy, we show, is relatively domain independent given a statistically significant sample. Morpheme entropy can be calculated statistically, though a theoretical upper bound can also be postulated based on language dependent morphology rules. This paper presents both theoretical and statistical estimates of the morpheme entropy for Tibetan, and explores the Tibetan Unicode encoding scheme in relation to data compression, and other issues analyzed in light of entropy-based language modeling
Recommended from our members
An Entropy-based Assessment of the Unicode Encoding for Tibetan
This paper presents an analysis of the Unicode encoding scheme for Tibetan from the standpoint of morpheme entropy. We can speak of two levels of entropy in Tibetan: syllable entropy (a measure of the probability of the sequential occurrence of syllables), and morpheme entropy (a measure of the probability of the sequential occurrence of characters or morphemes), the latter being a measure of the redundancy of the language. Syllable entropy is a purely statistical calculation that is a function of the domain of the literature sampled, while morpheme entropy, we show, is relatively domain independent given a statistically significant sample. Morpheme entropy can be calculated statistically, though a theoretical upper bound can also be postulated based on language dependent morphology rules. This paper presents both theoretical and statistical estimates of the morpheme entropy for Tibetan, and explores the Tibetan Unicode encoding scheme in relation to data compression, and other issues analyzed in light of entropy-based language modeling
Evaluating Parsing Schemes with Entropy Indicators.
This paper introduces an objective metric for evaluating a parsing scheme. It is based on Shannon's
original work with letter sequences, which can be extended to part-of-speech tag sequences. It is shown that this regular language is an inadequate model for natural language, but a representation is used that models language slightly higher in the Chomsky hierarchy. We show how the entropy of parsed and unparsed sentences can be measured. If the entropy of the parsed sentence is lower, this indicates that some of the structure of the language has been captured. We apply this entropy indicator to support one particular parsing scheme that effects a top down segmentation. This approach could be used to decompose the parsing task into computationally more tractable subtasks. It also lends itself to the extraction of predicate argument structure
Evaluating parsing schemes with entropy indicators
This paper introduces an objective metric for assessing the effectiveness of a parsing scheme. Information theoretic indicators can be used to show whether a given scheme captures some of the structure of natural language text. We then use this method to support a proposal to decompose the parsing task into computionally more tractable subtasks