8,232 research outputs found

    Evaluation of automatic hypernym extraction from technical corpora in English and Dutch

    Get PDF
    In this research, we evaluate different approaches for the automatic extraction of hypernym relations from English and Dutch technical text. The detected hypernym relations should enable us to semantically structure automatically obtained term lists from domain- and user-specific data. We investigated three different hypernymy extraction approaches for Dutch and English: a lexico-syntactic pattern-based approach, a distributional model and a morpho-syntactic method. To test the performance of the different approaches on domain-specific data, we collected and manually annotated English and Dutch data from two technical domains, viz. the dredging and financial domain. The experimental results show that especially the morpho-syntactic approach obtains good results for automatic hypernym extraction from technical and domain-specific texts

    Part of Speech Tagging of Marathi Text Using Trigram Method

    Get PDF
    In this paper we present a Marathi part of speech tagger. It is a morphologically rich language. It is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using trigram Method. The main concept of trigram is to explore the most likely POS for a token based on given information of previous two tags by calculating probabilities to determine which is the best sequence of a tag. In this paper we show the development of the tagger. Moreover we have also shown the evaluation done

    Identification of Fertile Translations in Medical Comparable Corpora: a Morpho-Compositional Approach

    Get PDF
    This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate 'fertile' translations. We show that fertile translations increase the overall quality of the extracted lexicon for English to French translation

    Determination: a universal dimension for inter-language comparison : (preliminary version)

    Get PDF
    The basic idea I want to develop and to substantiate in this paper consists in replacing – where necessary – the traditional concept of linguistic category or linguistic relation understood as 'things', as reified hypostases, by the more dynamic concept of dimension. A dimension of language structure is not coterminous with one single category or relation but, instead, accommodates several of them. It corresponds to certain well circumscribed purposive functions of linguistic activity as well as to certain definite principles and techniques for satisfying these functions. The true universals of language are represented by these dimensions, principles, and techniques which constitute the true basis for non-historical inter-language comparison. The categories and relations used in grammar are condensations – hypostases as it were – of such dimensions, principles, and techniques. Elsewhere I have outlined the theory which I want to test here in a case study

    Danish Academic Vocabulary:Four studies on the words of academic written Danish

    Get PDF

    Macro- and microstructural issues in Mazuna lexicography

    Get PDF
    All the works in Mazuna lexicography have a common denominator: they are translation dictionaries biased towards French and were compiled by Catholic and Protestant missionaries or colonial administrators. These dictionaries have both strong and weak points. The macrostructure although it does not display features of sophistication, i.e. the use of niching and nesting procedures, tends to survey the full lexicon of the language which make these dictionaries real reservoirs of knowledge. The microstructure contains a lot of useful entries. However, no metalexicographic discussion is provided in the user's guide to make it accessible to the target reader. There are also some shortcomings especially in the areas of suprasegmental phonology (absence of tonal indications) and orthography.Tous les travaux en lexicographie Mazuna ont un dénominateur commun: ce sont des dictionnaires de traduction centrés sur le français et compilés par les missionnaires catholiques et protestants ou les administrateurs coloniaux. Ces dictionnaires ont à la fois des avantages et des inconvénients. Bien que ne présentant pas de caractéristiques de sophistication, par exemple l'usage de procédures de nichification et de nidification, la macrostructure tend à donner une vue d'ensemble du lexique de la langue, ce qui fait de ces dictionnaires de véritables réservoirs de connaissance. La microstructure contient de nombreuses entrées utiles. Mais aucune discussion métalexicographique n'est présentée dans le guide aux usagers pour les leurs rendre accessible. Il y a également des manquements, spécialement dans le domaine de la phonologie suprasegmentale (absence d'indications tonales) et de l'orthographe

    Automatic Discovery of Non-Compositional Compounds in Parallel Data

    Full text link
    Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discovering sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of non-compositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunciations.Comment: 12 pages; uses natbib.sty, here.st

    Using distributional similarity to organise biomedical terminology

    Get PDF
    We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy
    • …
    corecore