8,415 research outputs found
Evaluation of automatic hypernym extraction from technical corpora in English and Dutch
In this research, we evaluate different approaches for the automatic extraction of hypernym relations from English and Dutch technical text. The detected hypernym relations should enable us to semantically structure automatically obtained term lists from domain- and user-specific data. We investigated three different hypernymy extraction approaches for Dutch and English: a lexico-syntactic pattern-based approach, a distributional model and a morpho-syntactic method. To test the performance of the different approaches on domain-specific data, we collected and manually annotated English and Dutch data from two technical domains, viz. the dredging and financial domain. The experimental results show that especially the morpho-syntactic approach obtains good results for automatic hypernym extraction from technical and domain-specific texts
Part of Speech Tagging of Marathi Text Using Trigram Method
In this paper we present a Marathi part of speech tagger. It is a morphologically rich language. It is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using trigram Method. The main concept of trigram is to explore the most likely POS for a token based on given information of previous two tags by calculating probabilities to determine which is the best sequence of a tag. In this paper we show the development of the tagger. Moreover we have also shown the evaluation done
Identification of Fertile Translations in Medical Comparable Corpora: a Morpho-Compositional Approach
This paper defines a method for lexicon in the biomedical domain from
comparable corpora. The method is based on compositional translation and
exploits morpheme-level translation equivalences. It can generate translations
for a large variety of morphologically constructed words and can also generate
'fertile' translations. We show that fertile translations increase the overall
quality of the extracted lexicon for English to French translation
Determination: a universal dimension for inter-language comparison : (preliminary version)
The basic idea I want to develop and to substantiate in this paper consists in replacing – where necessary – the traditional concept of linguistic category or linguistic relation understood as 'things', as reified hypostases, by the more dynamic concept of dimension. A dimension of language structure is not coterminous with one single category or relation but, instead, accommodates several of them. It corresponds to certain well circumscribed purposive functions of linguistic activity as well as to certain definite principles and techniques for satisfying these functions. The true universals of language are represented by these dimensions, principles, and techniques which constitute the true basis for non-historical inter-language comparison. The categories and relations used in grammar are condensations – hypostases as it were – of such dimensions, principles, and techniques. Elsewhere I have outlined the theory which I want to test here in a case study
Macro- and microstructural issues in Mazuna lexicography
All the works in Mazuna lexicography have a common denominator: they are translation dictionaries biased towards French and were compiled by Catholic and Protestant missionaries or colonial administrators. These dictionaries have both strong and weak points. The macrostructure although it does not display features of sophistication, i.e. the use of niching and nesting procedures, tends to survey the full lexicon of the language which make these dictionaries real reservoirs of knowledge. The microstructure contains a lot of useful entries. However, no metalexicographic discussion is provided in the user's guide to make it accessible to the target reader. There are also some shortcomings especially in the areas of suprasegmental phonology (absence of tonal indications) and orthography.Tous les travaux en lexicographie Mazuna ont un dénominateur commun: ce sont des dictionnaires de traduction centrés sur le français et compilés par les missionnaires catholiques et protestants ou les administrateurs coloniaux. Ces dictionnaires ont à la fois des avantages et des inconvénients. Bien que ne présentant pas de caractéristiques de sophistication, par exemple l'usage de procédures de nichification et de nidification, la macrostructure tend à donner une vue d'ensemble du lexique de la langue, ce qui fait de ces dictionnaires de véritables réservoirs de connaissance. La microstructure contient de nombreuses entrées utiles. Mais aucune discussion métalexicographique n'est présentée dans le guide aux usagers pour les leurs rendre accessible. Il y a également des manquements, spécialement dans le domaine de la phonologie suprasegmentale (absence d'indications tonales) et de l'orthographe
Automatic Discovery of Non-Compositional Compounds in Parallel Data
Automatic segmentation of text into minimal content-bearing units is an
unsolved problem even for languages like English. Spaces between words offer an
easy first approximation, but this approximation is not good enough for machine
translation (MT), where many word sequences are not translated word-for-word.
This paper presents an efficient automatic method for discovering sequences of
words that are translated as a unit. The method proceeds by comparing pairs of
statistical translation models induced from parallel texts in two languages. It
can discover hundreds of non-compositional compounds on each iteration, and
constructs longer compounds out of shorter ones. Objective evaluation on a
simple machine translation task has shown the method's potential to improve the
quality of MT output. The method makes few assumptions about the data, so it
can be applied to parallel data other than parallel texts, such as word
spellings and pronunciations.Comment: 12 pages; uses natbib.sty, here.st
Using distributional similarity to organise biomedical terminology
We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy
- …