48 research outputs found

    Morphological annotation of Korean with Directly Maintainable Resources

    Get PDF
    This article describes an exclusively resource-based method of morphological annotation of written Korean text. Korean is an agglutinative language. Our annotator is designed to process text before the operation of a syntactic parser. In its present state, it annotates one-stem words only. The output is a graph of morphemes annotated with accurate linguistic information. The granularity of the tagset is 3 to 5 times higher than usual tagsets. A comparison with a reference annotated corpus showed that it achieves 89% recall without any corpus training. The language resources used by the system are lexicons of stems, transducers of suffixes and transducers of generation of allomorphs. All can be easily updated, which allows users to control the evolution of the performances of the system. It has been claimed that morphological annotation of Korean text could only be performed by a morphological analysis module accessing a lexicon of morphemes. We show that it can also be performed directly with a lexicon of words and without applying morphological rules at annotation time, which speeds up annotation to 1,210 word/s. The lexicon of words is obtained from the maintainable language resources through a fully automated compilation process

    A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging

    Full text link
    In this paper, we propose a new approach to construct a system of transformation rules for the Part-of-Speech (POS) tagging task. Our approach is based on an incremental knowledge acquisition method where rules are stored in an exception structure and new rules are only added to correct the errors of existing rules; thus allowing systematic control of the interaction between the rules. Experimental results on 13 languages show that our approach is fast in terms of training time and tagging speed. Furthermore, our approach obtains very competitive accuracy in comparison to state-of-the-art POS and morphological taggers.Comment: Version 1: 13 pages. Version 2: Submitted to AI Communications - the European Journal on Artificial Intelligence. Version 3: Resubmitted after major revisions. Version 4: Resubmitted after minor revisions. Version 5: to appear in AI Communications (accepted for publication on 3/12/2015

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora

    Get PDF
    Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior knowledge broad-coverage lexical resources; the SALMA – ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA –Tag Set is a theory standard for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent. The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging

    Corpus-based typology: Applications, challenges and some solutions

    Get PDF
    Over the last few years, the number of corpora that can be used for language comparison has dramatically increased. The corpora are so diverse in their structure, size and annotation style, that a novice might not know where to start. The present paper charts this new and changing territory, providing a few landmarks, warning signs and safe paths. Although no corpora corpus at present can replace the traditional type of typological data based on language description in reference grammars, they corpora can help with diverse tasks, being particularly well suited for investigating probabilistic and gradient properties of languages and for discovering and interpreting cross-linguistic generalizations based on processing and communicative mechanisms. At the same time, the use of corpora for typological purposes has not only advantages and opportunities, but also numerous challenges. This paper also contains an empirical case study addressing two pertinent problems: the role of text types in language comparison and the problem of the word as a comparative concept

    An automatic morphological analysis system for Indonesian

    Get PDF
    This thesis reports the creation of SANTI-morf (Sistem Analisis Teks Indonesia – morfologi), a rule-based system that performs morphological annotation for Indonesian. The system has been built across three stages, namely preliminaries, annotation scheme creation (the linguistic aspect of the project), and system implementation (the computational aspect of the project). The preliminary matters covered include the necessary key concepts in morphology and Natural Language Processing (NLP), as well as a concise description of Indonesian morphology (largely based on the two primary reference grammars of Indonesian, Alwi et al. 1998 and Sneddon et al. 2010, together with work in the linguistic literature on Indonesian morphology (e.g. Kridalaksana 1989; Chaer 2008). As part of this preliminary stage, I created a testbed corpus for evaluation purposes. The design of the testbed is justified by considering the design of existing evaluation corpora, such as the testbed used by the English Constraint Grammar or EngCG system (Voutilanen 1992), the British National Corpus (BNC) 1994 evaluation data , and the training data used by MorphInd (Larasati et al. 2011), a morphological analyser (MA) for Indonesian. The dataset for this testbed was created by narrowing down an existing very large bit unbalanced collection of texts (drawn from the Leipzig corpora; see Goldhahn et al. 2012). The initial collection was reduced to a corpus composed of nine domains following the domain categorisation of the BNC) . A set of texts from each domain, proportional in size, was extracted and combined to form a testbed that complies with the design cited informed by the prior literature. The second stage, scheme creation, involved the creation of a new Morphological Annotation Scheme (MAS) for Indonesian, for use in the SANTI-morf system. First, a review of MASs in different languages (Finnish, Turkish, Arabic, Indonesian) as well as the Universal Dependencies MAS identifies the best practices in the field. From these, 15 design principles for the novel MAS were devised. This MAS consists of a morphological tagset, together with comprehensive justification of the morphological analyses used in the system. It achieves full morpheme-level annotation, presenting each morpheme’s orthographic and citation forms in the defined output, accompanied by robust morphological analyses, both formal and functional; to my knowledge, this is the first MAS of its kind for Indonesian. The MAS’s design is based not only on reference grammars of Indonesian and other linguistic sources, but also on the anticipated needs of researchers and other users of texts and corpora annotated using this scheme of analysis. The new MAS aims at The third stage of the project, implementation, consisted of three parts: a benchmarking evaluation exercise, a survey of frameworks and tools, leading ultimately to the actual implementation and evaluation of SANTI-morf. MorphInd (Larasati et al. 2012) is the prior state-of-the-art MA for Indonesian. That being the case, I evaluated MorphInd’s performance against the aforementioned testbed, both as just5ification of the need for an improved system, and to serve as a benchmark for SANTI-morf. MorphInd scored 93% on lexical coverage and 89% on tagging accuracy. Next, I surveyed existing MAs frameworks and tools. This survey justifies my choice for the rule-based approach (inspired by Koskenniemi’s 1983 Two Level Morphology, and NooJ (Silberztein 2S003) as respectively the framework and the software tool for SANTI-morf. After selection of this approach and tool, the language resources that constitute the SANTI-morf system were created. These are, primarily, a number of lexicons and sets of analysis rules, as well as necessary NooJ system configuration files. SANTI-morf’s 3 lexicon files (in total 86,590 entries) and 15 rule files (in total 659 rules) are organised into four modules, namely the Annotator, the Guesser, the Improver and the Disambiguator. These modules are applied one after another in a pipeline. The Annotator provides initial morpheme-level annotation for Indonesian words by identifying their having been built according to various morphological processes (affixation, reduplication, compounding, and cliticisation). The Guesser ensures that words not covered by the Annotator, because they are not covered by its lexicons, receive best guesses as to the correct analysis from the application of a set of probable but not exceptionless rules. The Improver improves the existing annotation, by adding probable analyses that the Annotator might have missed. Finally, the Disambiguator resolves ambiguities, that is, words for which the earlier elements of the pipeline have generated two or more possible analyses in terms of the morphemes identified or their annotation. NooJ annotations are saved in a binary file, but for evaluation purposes, plain-text output is required. I thus developed a system for data export using an in-NooJ mapping to and from a modified, exportable expression of the MAS, and wrote a small program to enable re-conversion of the output in plain-text format. For purposes of the evaluation, I created a 10,000 -word gold-standard SANTI-morf manually-annotated dataset. The outcome of the evaluation is that SANTI-morf has 100% coverage (because a best-guess analysis is always provided for unrecognised word forms), and 99% precision and recall for the morphological annotations, with a 1% rate of remaining ambiguity in the final output. SANTI-morf is thus shown to present a number of advancements over MorphInd, the state-of-the-art MA for Indonesian, exhibiting more robust annotation and better coverage. Other performance indicators, namely the high precision and recall, make SANTI-morf a concrete advance in the field of automated morphological annotation for Indonesian, and in consequence a substantive contribution to the field of Indonesian linguistics overall
    corecore