988 research outputs found
Online Dictionary - Tool for Preservation of Language Heritage
The paper aims to represent a bilingual online dictionary as a useful
tool helping preservation of the natural languages. The author focuses on the
approach that was taken to develop compatible bilingual lexical database for the
Bulgarian-Polish online dictionary. A formal model for the dictionary encoding
is developed in accordance with the complex structures of the dictionary entries. These structures vary depending on the grammatical characteristics of
Bulgarian headwords. The Web-application for presentation of the bilingual
dictionary is also describred
Information Technologies for the Preservation of Language Heritage
In this paper we try to present how information technologies as tools
for the creation of digital bilingual dictionaries can help the preservation of
natural languages. Natural languages are an outstanding part of human cultural
values and for that reason they should be preserved as part of the world cultural
heritage. We describe our work on the bilingual lexical database supporting the
Bulgarian-Polish Online dictionary. The main software tools for the web-
presentation of the dictionary are shortly described. We focus our special
attention on the presentation of verbs, the richest from a specific characteristics
viewpoint linguistic category in Bulgarian
Low hanging fruit and the Boasian trilogy in digital lexicography of morphologically rich languages: Lessons from a survey of Indigenous language resources in Canada
Online lexicographical resources for the morphologically rich Indigenous languages in Canada use a wide range of strategies for conveying their language’s morphological system, i.e. how words are inflected and derived, which this paper illustrates in a survey of seventeen bilingual online resources. The strategies these resources employ boil down to two basic approaches to the underlying structure of the resource: 1) a lexical database, or 2) a computational model. Most resources we surveyed are constructed around lexical databases. These assume the word(form) as the basic unit, an assumption that makes it difficult to incorporate the language’s sub-word, morphological structure in full detail. However, one resource uses a computational morphological model to bring the language’s morphology into the core of the lexicon – this proved to be a “low-hanging fruit” in the application of language technology that had been accomplished within a reasonable time-frame, as has been advocated by Trond Trosterud. We discuss the value created and questions raised by this approach and argue that it successfully overcomes the traditional Boasian three-way partition of dictionary, grammar, and text, creating integrated language resources that meet the modern needs of low-resource endangered languages and their communities
Multilingual digital resources with Bulgarian language
Multilingual digital resources with Bulgarian languageThe paper presents in brief Bulgarian language resources as a part of multilingual digital resources developed in the frame of some international projects, among them parallel annotated and aligned corpora, comparable corpora, morpho-syntactic specifications for corpora annotation and dictionaries encoding, lexicons, lexical databases, and electronic dictionaries
Simulating the Machine Translation of Low-Resource Languages by Designing a Translator Between English and an Artificially Constructed Language
Natural language processing (NLP), or the use of computers to analyze natural language, is a field that relies heavily on syntax. It would seem intuitive that computers would thrive in this area due to their strict syntax requirements, but the syntax of natural languages leaves them unable to properly parse and generate sentences that seem normal to the average speaker. A subfield of NLP, machine translation, works mainly to computerize translation between different languages. Unfortunately, such translation is not without its weaknesses; language documentation is not created equal, and many low-resource languages—languages with relatively few kinds of documentation, most often written—are left with no way to effectively benefit from machine translation. As a step toward better translation processors for low-resource languages, this thesis examined the possibility of machine translation between high resource languages and low resource languages through an analysis of different machine learning techniques, and ultimately constructing a simple translator between English and an artificially constructed language using a context-free grammar (CFG)
Vector Search with OpenAI Embeddings: Lucene Is All You Need
We provide a reproducible, end-to-end demonstration of vector search with
OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test
collection. The main goal of our work is to challenge the prevailing narrative
that a dedicated vector store is necessary to take advantage of recent advances
in deep neural networks as applied to search. Quite the contrary, we show that
hierarchical navigable small-world network (HNSW) indexes in Lucene are
adequate to provide vector search capabilities in a standard bi-encoder
architecture. This suggests that, from a simple cost-benefit analysis, there
does not appear to be a compelling reason to introduce a dedicated vector store
into a modern "AI stack" for search, since such applications have already
received substantial investments in existing, widely deployed infrastructure
Bulgarian-Polish Language Resources (Current State and Future Development)
Bulgarian-Polish Language Resources (Current State and Future Development)The paper briefly reviews the first Bulgarian-Polish digital bilingual resources: corpora and dictionaries, which are currently developed under bilateral collaboration between IMI-BAS and ISS-PAS: joint research project “Semantics and contrastive linguistics with a focus on a bilingual electronic dictionary”, coordinated by L. Dimitrova (IMI-BAS) and V. Koseska (ISS-PAS)
- …