5 research outputs found

    Disambiguierung deutschsprachiger Diskursmarker: Eine Pilot-Studie

    Get PDF
    Discourse markers such as German aber, wohl or obwohl can be regarded as valuable information for a wide range of text-linguistic applications, since they provide important cues for the interpretation of texts or text segments. Unfortunately, many of them are highly ambiguous. Thus, for their use in applications like automatic text summarizations a reliable disambiguation of discourse markers is needed. This should be done automatically, since manual disambiguation is feasible only for small amounts of data. The aim of this pilot study, therefore, was to investigate methodological requirements of automatic disambiguation of German discourse markers. Two different methods known from word-sense disambiguation, Naive-Bayes and decisionlists, were used for the highly ambiguous marker wenn. A statistical approach was taken to compare the two approaches and different feature combinations

    Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora

    Get PDF
    Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior knowledge broad-coverage lexical resources; the SALMA – ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA –Tag Set is a theory standard for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent. The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging