84 research outputs found

    Clitic Climbing, Finiteness and the Raising-Control Distinction : A Corpusā€“based study

    Get PDF
    In the paper, we discuss the phenomenon of clitic climbing out of finite da2-complements in contemporary Serbian. Scholarsā€™ opinions on the acceptability and occurrence of this construction, based on a handful of self-made examples, vary considerably. Expanding on the assumption that the correctness of the phenomenon has often been denied due to its rareness we employ large corpora to examine the problem. We focus on possible constraints arising from the syntactic properties of clause-embedding predicates.Peer reviewe

    Otvoreni resursi i tehnologije za obradu srpskog jezika

    Get PDF
    Open language resources and tools are very important for increasing the quality and speeding up the development of technologies for natural language processing. This paper presents a set of open resources available for processing the Serbian language. We describe several manually annotated corpora, as well as a range of computational models, including a web service designed in order to facilitate their use

    Corpus-Based Approaches to Igbo Diacritic Restoration

    Get PDF
    With natural language processing (NLP), researchers aim to get the computer to identify and understand the patterns in human languages. This is often difficult because a language embeds many dynamic and varied properties in its syntaxes, pragmatics and phonology, which needs to be captured and processed. The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries. But these research works focus more on well resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 95% of the worldā€™s 7000 languages are low-resourced for NLP i.e. they have little or no data, tools, and techniques for NLP work. In this thesis, we present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages. Focusing on Igbo language, we report the steps taken to develop a flexible framework for generating datasets for diacritic restoration. Three main approaches, the standard n-gram model, the classification models and the embedding models were proposed. The standard n-gram models use a sequence of previous words to the target stripped word as key predictors of the correct variants. For the classification models, a window of words on both sides of the target stripped word were use. The embedding models compare the similarity scores of the combined context word embeddings and the embeddings of each of the candidate variant vectors. The processes and techniques involved in projecting embeddings from a model trained with English texts to an Igbo embedding space and the creation of intrinsic evaluation tasks to validate the models were also discussed. A comparative analysis of the results indicate that all the approaches significantly improved on the baseline performance which uses the unigram model. The details of the processed involved in building the models as well as the possible directions for future work are discussed in this work

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF

    Orthographies in Early Modern Europe

    Get PDF
    This volume provides, for the first time, a pan-European view of the development of written languages at a key time in their history: that of the 16th century. The major cultural and intellectual upheavals that affected Europe at the time - Humanism, the Reformation and the emergence of modern nation-states - were not isolated phenomena, and the evolution of the orthographical systems of European languages shows a large number of convergences, due to the mobility of scholars, ideas and technological innovations throughout the period

    JANES v0.4: Korpus slovenskih spletnih uporabniŔkih vsebin

    Get PDF
    V prispevku predstavimo najnovejŔo različico korpusa spletne slovenŔčine Janes, ki vsebuje tvite, spletne forume, novice in uporabniŔke komentarje nanje, blogovske zapise in komentarje nanje ter uporabniŔke in pogovorne strani na Wikipediji. Najprej opiŔemo postopek zajema besedil za vsakega od vključenih virov in podamo kvantitativno analizo zgrajenega korpusa. Sledi predstavitev avtomatskih in ročnih postopkov za obogatitev korpusa s koristnimi metapodatki, kot so tip, spol in regija avtorja ter sentiment in stopnja tehnične in jezikovne standardnosti posameznega besedila. Prispevek sklenemo z opisom delotoka za jezikoslovno označevanje korpusa, ki vključuje tokenizacijo, stavčno segmentacijo, rediakritizacijo, normalizacijo, oblikoskladenjsko označevanje in lematizacijo

    Orthographies in Early Modern Europe

    Get PDF
    This volume provides, for the first time, a pan-European view of the development of written languages at a key time in their history: that of the 16th century. The major cultural and intellectual upheavals that affected Europe at the time - Humanism, the Reformation and the emergence of modern nation-states - were not isolated phenomena, and the evolution of the orthographical systems of European languages shows a large number of convergences, due to the mobility of scholars, ideas and technological innovations throughout the period
    • ā€¦
    corecore