147 research outputs found
Thematically Reinforced Explicit Semantic Analysis
We present an extended, thematically reinforced version of Gabrilovich and
Markovitch's Explicit Semantic Analysis (ESA), where we obtain thematic
information through the category structure of Wikipedia. For this we first
define a notion of categorical tfidf which measures the relevance of terms in
categories. Using this measure as a weight we calculate a maximal spanning tree
of the Wikipedia corpus considered as a directed graph of pages and categories.
This tree provides us with a unique path of "most related categories" between
each page and the top of the hierarchy. We reinforce tfidf of words in a page
by aggregating it with categorical tfidfs of the nodes of these paths, and
define a thematically reinforced ESA semantic relatedness measure which is more
robust than standard ESA and less sensitive to noise caused by out-of-context
words. We apply our method to the French Wikipedia corpus, evaluate it through
a text classification on a 37.5 MB corpus of 20 French newsgroups and obtain a
precision increase of 9-10% compared with standard ESA.Comment: 13 pages, 2 figures, presented at CICLing 201
Les math\'ematiques de la langue : l'approche formelle de Montague
We present a natural language modelization method which is strongely relying
on mathematics. This method, called "Formal Semantics," has been initiated by
the American linguist Richard M. Montague in the 1970's. It uses mathematical
tools such as formal languages and grammars, first-order logic, type theory and
-calculus. Our goal is to have the reader discover both Montagovian
formal semantics and the mathematical tools that he used in his method.
-----
Nous pr\'esentons une m\'ethode de mod\'elisation de la langue naturelle qui
est fortement bas\'ee sur les math\'ematiques. Cette m\'ethode, appel\'ee
{\guillemotleft}s\'emantique formelle{\guillemotright}, a \'et\'e initi\'ee par
le linguiste am\'ericain Richard M. Montague dans les ann\'ees 1970. Elle
utilise des outils math\'ematiques tels que les langages et grammaires formels,
la logique du 1er ordre, la th\'eorie de types et le -calcul. Nous
nous proposons de faire d\'ecouvrir au lecteur tant la s\'emantique formelle de
Montague que les outils math\'ematiques dont il s'est servi.Comment: 14 pages, in French. Will appear in the journal Quadrature
(http://www.quadrature.info) in 201
The Khmer Script Tamed by the Lion (of TeX)
International audienceThis paper presents a Khmer typesetting system, based on TeX, METAFONT, and an ANSI-C filter. A 128-character of the 8-bit ASCII table for the Khmer script is proposed. Input of text is done phonically (using the spoken order consonant-subscript consonant-second subscript consonant-vowel-diacritic). The filter converts phonic description of consonantal clusters into a graphic TeXnical description of these. Thanks to TeX booleans, independent vowels can be automatically decomposed according to recent reforms of Khmer spelling. The last section presents a forthcoming implementation of Khmer into a 16-bit TeX output font, solving the kerning problem of consonantal clusters
- …