164 research outputs found
Recommended from our members
An Entropy-based Assessment of the Unicode Encoding for Tibetan
This paper presents an analysis of the Unicode encoding scheme for Tibetan from the standpoint of morpheme entropy. We can speak of two levels of entropy in Tibetan: syllable entropy (a measure of the probability of the sequential occurrence of syllables), and morpheme entropy (a measure of the probability of the sequential occurrence of characters or morphemes), the latter being a measure of the redundancy of the language. Syllable entropy is a purely statistical calculation that is a function of the domain of the literature sampled, while morpheme entropy, we show, is relatively domain independent given a statistically significant sample. Morpheme entropy can be calculated statistically, though a theoretical upper bound can also be postulated based on language dependent morphology rules. This paper presents both theoretical and statistical estimates of the morpheme entropy for Tibetan, and explores the Tibetan Unicode encoding scheme in relation to data compression, and other issues analyzed in light of entropy-based language modeling
Recommended from our members
An Entropy-based Assessment of the Unicode Encoding for Tibetan
This paper presents an analysis of the Unicode encoding scheme for Tibetan from the standpoint of morpheme entropy. We can speak of two levels of entropy in Tibetan: syllable entropy (a measure of the probability of the sequential occurrence of syllables), and morpheme entropy (a measure of the probability of the sequential occurrence of characters or morphemes), the latter being a measure of the redundancy of the language. Syllable entropy is a purely statistical calculation that is a function of the domain of the literature sampled, while morpheme entropy, we show, is relatively domain independent given a statistically significant sample. Morpheme entropy can be calculated statistically, though a theoretical upper bound can also be postulated based on language dependent morphology rules. This paper presents both theoretical and statistical estimates of the morpheme entropy for Tibetan, and explores the Tibetan Unicode encoding scheme in relation to data compression, and other issues analyzed in light of entropy-based language modeling
Recommended from our members
Proposal for encoding the Meitei Mayek script in the BMP of the UCS
This is the penultimate proposal to encode the Meetei Mayek script (also spelled Meitei Mayek) in the international character encoding standard Unicode. The main, modern repertoire was published in Unicode Standard version 5.2 in October 2009. A set of characters to represent historical orthographies was published later, in Unicode 6.1 in January 2012. This proposal included both the modern repertoire and the set of historical characters. However, later proposals split the repertoire into two documents: for the modern characters, an
The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries
The first alphabetized dictionary of Tibetan appeared in 1829 (cf. Bray 2008) and the intervening 184 years have witnessed the publication of scores of other Tibetan dictionaries (cf. Simon 1964). Hundreds of Tibetan dictionaries are now available; these include bilin
gual dictionaries, both to and from such languages
as English, French, German, Latin, Japanese, etc. and specialized dictionaries focusing on medicine, plants, dialects, archaic terms, neologisms, etc. (cf. Walter 2006, McGrath 2008). However, if one classifies Tibetan dictionaries by the methods of their compilation the
accomplishments of Tibetan lexicography are less impressive.
Methodologies of dictionary compilation divide heuristically into three types. First, some dictionaries lack explicit methodology; these works assemble words in an
ad hoc manner and illustrate them with invented examples. Second, there are dictionaries that are compiled over very long periods of time on the basis of collections of slips
recording attestations of words as used in context. Third, more recent dictionaries are compiled on the basis of electronic text corpora, which are processed computationally to aid in the precision, consistency and speed of dictionary compilation. These methods may be called respectively the 'informal method', the 'traditional method', and the 'modern method'. The overwhelming majority of Tibetan dictionaries were compiled with the informal method. Only five Tibetan dictionaries use the traditional methodology. No Tibetan dictionary yet compiled makes
use of the modern method
Recommended from our members
Two Proposals for Critically Editing the Texts of the rNying ma'i rGyud 'bum
Revue d’Etudes Tibétaines Number 10, April 200
Recommended from our members
Proposal to Encode the Kaithi Script in ISO/IEC 10646
This is a proposal to encode the Kaithi script in the international character encoding standard Unicode. The script was published in Unicode Standard version 5.2 in October 2009. The script was used for administrative communication from at least the 19c until the early 20c to write Bhojpuri, Magahi, Awadhi, Maithili, Urdu, and other languages related to Hindi. It was also used in religious and literary materials, to record commercial transactions, and in correspondence and personal communication. The Kaithi script was eventually supplanted by Devanagari
Proceedings of the fifth International Conference on Asian Geolinguistics
This volume contains papers presented at the fifth International Conference on Asian Geolinguistics (ICAG) held at the University of Social Sciences and Humanities, VNU, Ha Noi, Vietnam, from 4 to 5 May, 2023
The internal structure of compounds: a phase account of aphasia
This study uses aphasia to support a phase-based derivation of
compounds. Our research is nestled within the overarching and truly
foundational debate between holists (Butterworth 1983, Bybee 2001, Starosta
to appear) and atomists (Taft and Forster 1975, Rastle et al. 2004, Fiorentino
and Poeppel 2007). The former camp maintains that compounds are stored
devoid of any internal morphological structure; while the latter insist that
compounds are derived by concatenation of constituent parts. Morphophonological
analysis of the contrasting behaviour of simplex and
compound words in Dinka and English (based on Kaye 1995) bears a
striking similarity to the derivation by phase (Chomsky 2001) (cf. Newell
and Piggott 2006, Newell and Scheer 2008, Scheer 2008, forth.). To confirm
this novel phase-based account, contra the holists’ null-hypothesis, we ran an
experiment. We tested an aphasic patient (RC), who produced high error
rates with trisyllabic simplex words and negligible error rates with disyllabic
simplex words. The divisive question: What would trisyllabic compounds
pattern with? The surface inclined holists predict they should pattern with
the long simplex words; conversely, the atomist, for whom a trisyllabic
compound will be processed either [[σ σ] [σ]] or [[σ] [σ σ]], predict they
should pattern with the short simplex words. The latter turns out to be
correct. Our experiment shows a compound is derived by independently
sending its constituent parts to spell out, once there the constituent parts are
no longer accessible to grammatical operations.Este trabajo usa el fenómeno de la afasia para apoyar una
derivación de los compuestos basada en el concepto de fases. Nuestra
investigación se enmarca dentro del debate general y fundamental entre
holistas (Butterworth 1983, Bybee 2001, Starosta en prensa) y atomistas (Taft
y Forster 1975, Rastle et al. 2004, Fiorentino y Poeppel 2007). Los primeros
sostienen que los compuestos son almacenados sin ningún tipo de estructura
morfológica interna; por el contrario los últimos insisten en que los
compuestos se derivan a través de la concatenación de ciertos constituyentes.
El análisis morfo-fonológico del comportamiento paradójico por parte de las
palabras simples y compuestas en Dinka e Inglés (basado en Kaye 1995)
muestra una similitud chocante con el fenómeno de la derivación por fases
(Chomsky 2001) (cf. Newell y Piggott 2006, Newell y Scheer 2008, Scheer
2008, en adelante.). Para verificar esta nueva versión basada en la noción de
fases, como contradicción a la hipótesis-nula llevada a cabo por los holistas,
realizamos un experimento. Probamos con un paciente afásico (CR), el cual
tuvo un alto porcentaje de errores con palabras simples de tres sÃlabas asÃ
como un promedio de error insignificante con palabras simples de dos
sÃlabas. La pregunta divisoria serÃa la siguiente: ¿Con qué se
corresponderÃan los compuestos de tres sÃlabas? Los que en apariencia
apoyan a los holistas sugieren que estos deberÃan tener un comportamiento
similar a las palabras simples y largas; por el contrario, los atomistas, para
quienes un compuesto trisil{bico ha de ser procesado bien como **σ σ+ *σ++ o
**σ+ *σ σ++, establecen que estos se asemejan a las palabras simples y cortas.
Estos últimos resultan ser los que están en lo cierto. Nuestro experimento
corrobora que un compuesto se deriva a través del envÃo de sus
constituyentes por separado a la fase de materialización, de tal manera que
una vez allà dichos constituyentes dejan de ser accesibles a operaciones
gramaticales.Este estudo recorre à afasia para confirmar a derivação por fases de
compostos. A nossa pesquisa enquadra-se no debate global e
verdadeiramente fundacional entre holistas (Butterworth 1983, Bybee 2001,
Starosta a surgir) e atomistas (Taft e Forster 1975, Rastle et al. 2004,
Fiorentino e Poeppel 2007). Os primeiros defendem que os compostos são
armazenados desprovidos de qualquer estrutura morfológica interna;
enquanto os últimos insistem que os compostos derivam da concatenação de
partes constituintes. A análise morfo-fonológica do comportamento
contrastante de palavras simples e compostas em Dinka e em Inglês
(baseada em Kaye 1995) apresenta uma semelhança assinalável com a
derivação por fase (Chomsky 2001) (cf. Newell e Piggott 2006, Newell e
Scheer 2008, Scheer 2008, etc.). Para confirmar esta nova abordagem baseada
em fases, contra a hipótese nula dos holistas, levámos a cabo uma
experiência. Testámos um paciente afásico (RC), que produziu elevadas
taxas de erro com palavras trissilábicas simples e taxas de erro pouco
significativas com palavras dissilábicas simples. A questão decorrente: Que
padrão seguem os compostos trissilábicos? Os holistas, baseados na
superfÃcie, predizem que estes seguem o padrão das palavras simples longas; inversamente, os atomistas, para quem um composto trissilábico é
processado como [[σ σ] [σ]] ou [[σ] [σ σ]], predizem que seguem o padrão
das palavras simples curtas. Os últimos estão correctos. A nossa experiência
demonstra que um composto é derivado, enviando independentemente as
suas partes constituintes para serem decifradas quando estas não se
encontram mais acessÃveis a operações gramaticaisFormato de letra Palatino
Linotype tamaño 12; interlineado de 1’2 y espacio entre p{rrafos de 6ptos.
Formato de letra Palatino Linotype tamaño 12; interlineado de 1’2 y espacio
entre párrafos de 6ptos. Formato de letra Palatino Linotype tamaño 12;
interlineado de 1’2 y espacio entre párrafos de 6ptos; interlineado de 1’2 y
espacio entre párrafos de 6ptos
A (Presumably Chinese) tantric scripture and its Japanese exegesis: the Yuqi Jing 瑜祇經 and the practices of the Yogin
The Yuqi jing [Sūtra of the Yogin] is often listed as one of the most important scriptures of Tantric Buddhism in East Asia, but its content and contribution to the esoteric system have so far been little understood. Traditionally regarded as a translation by Vajrabodhi, it was probably compiled in China in the late eighth century. The role that it played in Chinese Buddhism, however, remains unclear. In medieval Japan on the other hand, the scripture appears to have been rediscovered and enjoyed great fortunes. Medieval interpreters intervened on the text by articulating novel conceptual associations, often expressed through curious imagery. At the same time, a new type of initiatory abhiṣeka informed by the sūtra emerged, which engendered a distinctive discourse on the yogic identities pursued by a tantric practitioner. What spurred such sudden interest in the Yuqi jing in medieval Japan? What did Japanese exegetes read into the text? This article addresses these issues by exploring ‘canonical’ commentaries and unpublished initiatory documents that have recently come to light in temple archives
- …