988 research outputs found

    Online Dictionary - Tool for Preservation of Language Heritage

    Get PDF
    The paper aims to represent a bilingual online dictionary as a useful tool helping preservation of the natural languages. The author focuses on the approach that was taken to develop compatible bilingual lexical database for the Bulgarian-Polish online dictionary. A formal model for the dictionary encoding is developed in accordance with the complex structures of the dictionary entries. These structures vary depending on the grammatical characteristics of Bulgarian headwords. The Web-application for presentation of the bilingual dictionary is also describred

    Information Technologies for the Preservation of Language Heritage

    Get PDF
    In this paper we try to present how information technologies as tools for the creation of digital bilingual dictionaries can help the preservation of natural languages. Natural languages are an outstanding part of human cultural values and for that reason they should be preserved as part of the world cultural heritage. We describe our work on the bilingual lexical database supporting the Bulgarian-Polish Online dictionary. The main software tools for the web- presentation of the dictionary are shortly described. We focus our special attention on the presentation of verbs, the richest from a specific characteristics viewpoint linguistic category in Bulgarian

    MONDILEX – towards the research infrastructure for digital resources in Slavic lexicography

    Get PDF

    Low hanging fruit and the Boasian trilogy in digital lexicography of morphologically rich languages: Lessons from a survey of Indigenous language resources in Canada

    Get PDF
    Online lexicographical resources for the morphologically rich Indigenous languages in Canada use a wide range of strategies for conveying their language’s morphological system, i.e. how words are inflected and derived, which this paper illustrates in a survey of seventeen bilingual online resources. The strategies these resources employ boil down to two basic approaches to the underlying structure of the resource: 1) a lexical database, or 2) a computational model. Most resources we surveyed are constructed around lexical databases. These assume the word(form) as the basic unit, an assumption that makes it difficult to incorporate the language’s sub-word, morphological structure in full detail. However, one resource uses a computational morphological model to bring the language’s morphology into the core of the lexicon – this proved to be a “low-hanging fruit” in the application of language technology that had been accomplished within a reasonable time-frame, as has been advocated by Trond Trosterud. We discuss the value created and questions raised by this approach and argue that it successfully overcomes the traditional Boasian three-way partition of dictionary, grammar, and text, creating integrated language resources that meet the modern needs of low-resource endangered languages and their communities

    Multilingual digital resources with Bulgarian language

    Get PDF
    Multilingual digital resources with Bulgarian languageThe paper presents in brief Bulgarian language resources as a part of multilingual digital resources developed in the frame of some international projects, among them parallel annotated and aligned corpora, comparable corpora, morpho-syntactic specifications for corpora annotation and dictionaries encoding, lexicons, lexical databases, and electronic dictionaries

    Multilingual digital resources with Bulgarian language

    Get PDF

    Simulating the Machine Translation of Low-Resource Languages by Designing a Translator Between English and an Artificially Constructed Language

    Get PDF
    Natural language processing (NLP), or the use of computers to analyze natural language, is a field that relies heavily on syntax. It would seem intuitive that computers would thrive in this area due to their strict syntax requirements, but the syntax of natural languages leaves them unable to properly parse and generate sentences that seem normal to the average speaker. A subfield of NLP, machine translation, works mainly to computerize translation between different languages. Unfortunately, such translation is not without its weaknesses; language documentation is not created equal, and many low-resource languages—languages with relatively few kinds of documentation, most often written—are left with no way to effectively benefit from machine translation. As a step toward better translation processors for low-resource languages, this thesis examined the possibility of machine translation between high resource languages and low resource languages through an analysis of different machine learning techniques, and ultimately constructing a simple translator between English and an artificially constructed language using a context-free grammar (CFG)

    Vector Search with OpenAI Embeddings: Lucene Is All You Need

    Full text link
    We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection. The main goal of our work is to challenge the prevailing narrative that a dedicated vector store is necessary to take advantage of recent advances in deep neural networks as applied to search. Quite the contrary, we show that hierarchical navigable small-world network (HNSW) indexes in Lucene are adequate to provide vector search capabilities in a standard bi-encoder architecture. This suggests that, from a simple cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern "AI stack" for search, since such applications have already received substantial investments in existing, widely deployed infrastructure

    Bulgarian-Polish Language Resources (Current State and Future Development)

    Get PDF
    Bulgarian-Polish Language Resources (Current State and Future Development)The paper briefly reviews the first Bulgarian-Polish digital bilingual resources: corpora and dictionaries, which are currently developed under bilateral collaboration between IMI-BAS and ISS-PAS: joint research project “Semantics and contrastive linguistics with a focus on a bilingual electronic dictionary”, coordinated by L. Dimitrova (IMI-BAS) and V. Koseska (ISS-PAS)
    • …
    corecore