2 research outputs found

    Senųjų raštų rašybos keitimas paieškos sistemai

    Get PDF
    [full article and abstract in Lithuanian; abstract in English] The Lithuanian historical corpus consists of machine-readable texts, transcribed according to the principles of documentary edition; the original spelling and the language features it encodes are preserved. Several orthographic systems were used during the various stages of the history of Lithuanian language, and some of them differ from the modern one to a relatively great extent. The historical orthography does not allow the use of language analysis tools, which were developed on the basis of the modern spelling. A link is therefore needed that would connect the historical orthography to the modern orthography used today. In normalizing spelling, various challenges must be dealt with: the same grapheme must be differently realized without changing the orthography and by rewriting the form in the modern Lithuanian alphabet. At the same time, the normalization of phonetics has to be carried out, which includes the elimination of dialectal phonetic features and the representation of phonemes in the assimilated position. These principles can be used in constructing a universal search engine, in which queries can be processed across different orthographic systems (http://sr.lki.lt). The size of the corpus and the available limited resources stimulate the search for an automated way of normalizing orthography. A set of rules was developed based on the empirical research on the history of orthography; these rules were then arranged hierarchically in accordance with the length of the sequence of processed characters, their implementation being limited to using the metadata according to the spelling features of a particular source. A 82–97% accuracy level of correct normalization was achieved. The advantage of a rules-based transliteration is the consistency of changes; the disadvantage can be seen in generating not a single but several equivalents of the word, and the ambiguous rules in certain cases generate many tokens that do not exist in the natural language. The number of generated forms being fed to the search engine was reduced based on non-existent letter sequences and by narrowing the query alphabet. A further selection of the correct forms could be done using dictionaries or tools for analyzing the morphology and syntax of modern Lithuanian.[straipsnis ir santrauka lietuvių kalba; santrauka anglų kalba] Lingvistinei analizei reikia skaitmeninių tekstų, tinkamų programiniam apdorojimui. Lietuvių kalbos instituto duomenų bazei senieji raštai skaitmeninami laikantis dokumentinio perrašo principų, nekeičiant originalo rašybos. Senoji rašyba dažnai yra variantiška, nenusistovėjusi ir gerokai skiriasi nuo dabartinės, tai trukdo pritaikyti technologijas, kuriamas dabartinei lietuvių kalbai tirti. Straipsnyje aprašomas empirinėmis taisyklėmis paremtas būdas iš žodžių formų senąja rašyba automatiškai sugeneruoti formas dabartine rašyba perraše išlaikant originalios rašybos ypatybes. Sugeneruoti atitikmenys naudojami paieškos sistemoje

    Historiae, History of Socio-Cultural Transformation as Linguistic Data Science. A Humanities Use Case

    Get PDF
    The paper proposes an interdisciplinary approach including methods from disciplines such as history of concepts, linguistics, natural language processing (NLP) and Semantic Web, to create a comparative framework for detecting semantic change in multilingual historical corpora and generating diachronic ontologies as linguistic linked open data (LLOD). Initiated as a use case (UC4.2.1) within the COST Action Nexus Linguarum, European network for Web-centred linguistic data science, the study will explore emerging trends in knowledge extraction, analysis and representation from linguistic data science, and apply the devised methodology to datasets in the humanities to trace the evolution of concepts from the domain of socio-cultural transformation. The paper will describe the main elements of the methodological framework and preliminary planning of the intended workflow
    corecore