88 research outputs found

    A study on creating a custom South Sotho spellchecking and correcting software desktop application

    Get PDF
    Thesis (B. Tech.) - Central University of Technology, Free State, 200

    AmAMorph: Finite State Morphological Analyzer for Amazighe

    Get PDF
    This paper presents AmAMorph, a morphological analyzer for Amazighe language using a system based on the NooJ linguistic development environment. The paper begins with the development of Amazighe lexicons with large coverage formalization. The built electronic lexicons, named ‘NAmLex’, ‘VAmLex’ and ‘PAmLex’ which stand for ‘Noun Amazighe Lexicon’, ‘Verb Amazighe Lexicon’ and ‘Particles Amazighe Lexicon’, link inflectional, morphological, and syntacticsemantic information to the list of lemmas. Automated inflectional and derivational routines are applied to each lemma producing over inflected forms. To our knowledge,AmAMorph is the first morphological analyzer for Amazighe. It identifies the component morphemes of the forms using large coverage morphological grammars. Along with the description of how the analyzer is implemented, this paper gives an evaluation of the analyzer

    Terminology extraction from medical texts in Polish

    Full text link

    Models of the Serbian language and their application in speech and language technologies

    Get PDF
    Statistički jezički model, u teoriji, predstavlja raspodelu verovatnoća nad skupom svih mogućih sekvenci reči nekog jezika. U praksi, to je mehanizam kojim se estimiraju verovatnoće sekvenci, koje su od interesa. Matematički aparat vezan za modele jezika je uglavnom nezavisan od jezika. Međutim, kvalitet obučenih modela ne zavisi samo od algoritama obuke, već prvenstveno od količine i kvaliteta podataka koji su na raspolaganju za obuku. Za jezike sa kompleksnom morfologijom, kao što je srpski, tekstualni korpus za obuku modela mora biti daleko obimniji od korpusa koji bi se koristio kod nekog od jezika sa relativno jednostavnom morfologijom, poput engleskog. Ovo istraživanje obuhvata razvoj jezičkih modela za srpski jezik, počevši od prikupljanja i inicijalne obrade tekstualnih sadržaja, preko adaptacije algoritama i razvoja metoda za rešavanje problema nedovoljne količine podataka za obuku, pa do prilagođavanja i primene modela u različitim tehnologijama, kao što su sinteza govora na osnovu teksta, automatsko prepoznavanje govora, automatska detekcija i korekcija gramatičkih i semantičkih grešaka u tekstovima, a postavljaju se i osnove za primenu jezičkih modela u automatskoj klasifikaciji dokumenata i drugim tehnologijama. Jezgro razvoja jezičkih modela za srpski predstavlja definisanje morfoloških klasa reči na osnovu informacija koje su sadržane u morfološkom rečniku, koji je nastao kao rezultat jednog od ranijih istraživanja.A statistical language model, in theory, represents a probability distribution over sequences of words of a language. In practice, it is a tool for estimating probabilities of word sequences of interest. Mathematical basis related to language models is mostly language independent. However, the quality of trained models depends not only on training algorithms, but on the amount and quality of available training data as well. For languages with complex morphology, such as Serbian, textual corpora for training language models need to be significantly larger than the corpora needed for training language models for languages with relatively simple morphology, such as English. This research represents the entire process of developing language models for Serbian, starting with collecting and preprocessing of textual contents, extending to adaptation of algorithms and development of methods for addressing the problem of insufficient training data, and finally to adaptation and application of the models in different technologies, such as text-to-speech synthesis, automatic speech recognition, automatic detection and correction of grammar and semantic errors in texts, and determining basics for the application of the models in automatic document classification and other tasks. The core of the development of language models for Serbian is defining morphologic classes of words, based on the information contained within the morphologic dictionary of Serbian, which was one of the results of a previous research

    Noun formation in the scientific register of late modern english : a corpus-based approach

    Get PDF
    [Resumen] Esta tesis doctoral analiza procesos morfológicos de formación de nombres en el registro científico del inglés moderno tardío usando metodología de lingüística de corpus. Mediante el análisis de cuarenta y una muestras escritas en el siglo dieciocho, después de la llamada ‘revolución científica’, extraídas de dos subcórpora del Coruña Corpus of English Scientific Writing —el Corpus of English Texts on Astronomy y el Corpus of English Philosophy Texts—, se revisarán cerca de medio millón de palabras. Mi objetivo es determinar cuáles son los procesos más productivos no sólo desde un punto de vista sincrónico en dicho siglo, sino también desde un punto de vista diacrónico a través de la historia de la lengua. Además, los nombres complejos se dividirán en sus constituyentes básicos, que serán a continuación analizados para establecer cuantitativamente patrones de las bases y afijos más productivos, en qué fecha se utilizaron por primera vez en la lengua, y las fuentes etimológicas más abundantes del inglés dependiendo de los períodos. El estudio se complementa con variables extralingüísticas que comparan los recursos empleados según la disciplina científica, el género del texto, el sexo de los autores, el país donde se formaron, y su edad cuando publicaron sus trabajos.[Resumo] Esta tese de doutoramento analiza procesos morfolóxicos de formación de nomes no rexistro científico do inglés moderno tardío usando a metodoloxía da lingüística de corpus. Mediante a análise de corenta e unha mostras escritas no século dezaoito, despois da chamada ‘revolución científica’, extraídas de dous subcórpora do Coruña Corpus of English Scientific Writing —o Corpus of English Texts on Astronomy e o Corpus of English Philosophy Texts—, revisaranse perto de medio millón de palabras. O meu obxectivo é determinar cales son os procesos máis productivos non só desde un punto de vista sincrónico no devandito século, senón tamén desde un punto de vista diacrónico a través da historia da lingua. Ademáis, os nomes complexos dividiranse nos seus constituintes básicos, que serán analizados a continuación para establecer cuantitativamente padróns das bases e afixos máis productivos, en qué data se producíu o seu primeiro uso na lingua, e as fontes etimolóxicas máis abundantes do inglés dependendo dos períodos. O estudio ven complementado con variables extralingüísticas que comparan os recursos empregados según a disciplina científica, o xénero do texto, o sexo dos autores, o país onde se formaron, e a sua idade cando publicaron os seus traballos.[Abstract] This PhD. dissertation analises noun forming processes in the scientific register of late Modern English using corpus linguistics methodology. By revising forty-one samples written in the eighteenth century after the so-called ‘scientific revolution’, extracted from two subcorpora of the Coruña Corpus of English Scientific Writing —Corpus of English Texts on Astronomy and Corpus of English Philosophy Texts—, I will analyse nearly half a million words. My aim is to determine which are the most productive processes, not only from a synchronic point of view in the above-mentioned century, but also from a diachronic standpoint throughout the history of the language. Furthermore, complex nouns will be decomposed in their basic constituents, which will be subsequently analysed to establish quantitatively patterns of the most productive bases and affixes, the dates when they were coined and used for the first time in the language, and their source languages across different periods. This research work is complemented by several extralinguistic variables that help comparing the linguistic resources depending on the scientific discipline studied, genre/text-type, sex of the author, place of education and age when the work in question was published

    Natural language software registry (second edition)

    Get PDF

    Compiling and annotating a learner corpus for a morphologically rich language: CzeSL, a corpus of non-native Czech

    Get PDF
    Learner corpora, linguistic collections documenting a language as used by learners, provide an important empirical foundation for language acquisition research and teaching practice. This book presents CzeSL, a corpus of non-native Czech, against the background of theoretical and practical issues in the current learner corpus research. Languages with rich morphology and relatively free word order, including Czech, are particularly challenging for the analysis of learner language. The authors address both the complexity of learner error annotation, describing three complementary annotation schemes, and the complexity of description of non-native Czech in terms of standard linguistic categories. The book discusses in detail practical aspects of the corpus creation: the process of collection and annotation itself, the supporting tools, the resulting data, their formats and search platforms. The chapter on use cases exemplifies the usefulness of learner corpora for teaching, language acquisition research, and computational linguistics. Any researcher developing learner corpora will surely appreciate the concluding chapter listing lessons learned and pitfalls to avoid

    The TXM Portal Software giving access to Old French Manuscripts Online

    Get PDF
    Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)
    corecore