23 research outputs found

    Lemmatic machine translation

    Get PDF
    Abstract Statistical MT is limited by reliance on large parallel corpora. We propose Lemmatic MT, a new paradigm that extends MT to a far broader set of languages, but requires substantial manual encoding effort. We present PANLINGUAL TRANSLATOR, a prototype Lemmatic MT system with high translation adequacy on 59% to 99% of sentences (average 84%) on a sample of 6 language pairs that Google Translate (GT) handles. GT ranged from 34% to 93%, average 65%. PANLINGUAL TRANSLATOR also had high translation adequacy on 27% to 82% of sentences (average 62%) from a sample of 5 language pairs not handled by GT

    Linking Discourse Marker Inventories

    Get PDF
    The paper describes the first comprehensive edition of machine-readable discourse marker lexicons. Discourse markers such as and, because, but, though or thereafter are essential communicative signals in human conversation, as they indicate how an utterance relates to its communicative context. As much of this information is implicit or expressed differently in different languages, discourse parsing, context-adequate natural language generation and machine translation are considered particularly challenging aspects of Natural Language Processing. Providing this data in machine-readable, standard-compliant form will thus facilitate such technical tasks, and moreover, allow to explore techniques for translation inference to be applied to this particular group of lexical resources that was previously largely neglected in the context of Linguistic Linked (Open) Data

    Linking discourse marker inventories

    Get PDF
    The paper describes the first comprehensive edition of machine-readable discourse marker lexicons. Discourse markers such as and, because, but, though or thereafter are essential communicative signals in human conversation, as they indicate how an utterance relates to its communicative context. As much of this information is implicit or expressed differently in different languages, discourse parsing, context-adequate natural language generation and machine translation are considered particularly challenging aspects of Natural Language Processing. Providing this data in machine-readable, standard-compliant form will thus facilitate such technical tasks, and moreover, allow to explore techniques for translation inference to be applied to this particular group of lexical resources that was previously largely neglected in the context of Linguistic Linked (Open) Data

    ΠšΠΎΠ»ΠΈΡ‡Π΅ΡΡ‚Π²Π΅Π½Π½Ρ‹ΠΉ Π°Π½Π°Π»ΠΈΠ· лСксики английского языка Π² викисловарях ΠΈ Wordnet.

    Get PDF
    A quantitative analysis of the English lexicon was performed in the paper. The three electronic dictionaries are under examination: the English Wiktionary, WordNet, and the Russian Wiktionary. The quantity of English words and their meanings (senses) are calculated. The distribution of words for each part of speech, the quantity of monosemous and polysemous words and the distribution of words by number of meanings were calculated and compared across these dictionaries. The analysis shows that the average polysemy, the number and the distribution of word senses follow similar patterns in both expert and collaborative resources with relatively minor differences.Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ Π²Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ количСствСнный Π°Π½Π°Π»ΠΈΠ· лСксики английского языка ΠΏΠΎ Π΄Π°Π½Π½Ρ‹ΠΌ трѐх элСктронных словарСй: Английского Викисловаря, WordNet ΠΈ Русского Викисловаря. БравниваСтся объѐм словарСй ΠΈ распрСдСлСниС слов английского языка ΠΏΠΎ частям Ρ€Π΅Ρ‡ΠΈ. ΠŸΡ€ΠΈΠ²ΠΎΠ΄ΠΈΡ‚ΡΡ ΡΠΎΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠ΅ ΠΌΠ½ΠΎΠ³ΠΎΠ·Π½Π°Ρ‡Π½Ρ‹Ρ… слов ΠΈ слов с ΠΎΠ΄Π½ΠΈΠΌ Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ΠΌ, Π° Ρ‚Π°ΠΊΠΆΠ΅ распрСдСлСниС английских слов ΠΏΠΎ числу Π·Π½Π°Ρ‡Π΅Π½ΠΈΠΉ. ЭкспСримСнты ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°ΡŽΡ‚, Ρ‡Ρ‚ΠΎ лингвистичСскиС рСсурсы, созданныС ΠΊΠ°ΠΊ экспСртами, Ρ‚Π°ΠΊ ΠΈ энтузиастами, ΠΏΠΎΠ΄Ρ‡ΠΈΠ½ΡΡŽΡ‚ΡΡ ΠΎΠ±Ρ‰ΠΈΠΌ Π·Π°ΠΊΠΎΠ½Π°ΠΌ

    Bilingual dictionary generation and enrichment via graph exploration

    Get PDF
    In recent years, we have witnessed a steady growth of linguistic information represented and exposed as linked data on the Web. Such linguistic linked data have stimulated the development and use of openly available linguistic knowledge graphs, as is the case with the Apertium RDF, a collection of interconnected bilingual dictionaries represented and accessible through Semantic Web standards. In this work, we explore techniques that exploit the graph nature of bilingual dictionaries to automatically infer new links (translations). We build upon a cycle density based method: partitioning the graph into biconnected components for a speed-up, and simplifying the pipeline through a careful structural analysis that reduces hyperparameter tuning requirements. We also analyse the shortcomings of traditional evaluation metrics used for translation inference and propose to complement them with new ones, both-word precision (BWP) and both-word recall (BWR), aimed at being more informative of algorithmic improvements. Over twenty-seven language pairs, our algorithm produces dictionaries about 70% the size of existing Apertium RDF dictionaries at a high BWP of 85% from scratch within a minute. Human evaluation shows that 78% of the additional translations generated for dictionary enrichment are correct as well. We further describe an interesting use-case: inferring synonyms within a single language, on which our initial human-based evaluation shows an average accuracy of 84%. We release our tool as free/open-source software which can not only be applied to RDF data and Apertium dictionaries, but is also easily usable for other formats and communities.This work was partially funded by the PrΓͺt-Γ -LLOD project within the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 825182. This article is also based upon work from COST Action CA18209 NexusLinguarum, β€œEuropean network for Web-centred linguistic data science”, supported by COST (European Cooperation in Science and Technology). It has been also partially supported by the Spanish projects TIN2016-78011-C4-3-R and PID2020-113903RB-I00 (AEI/FEDER, UE), by DGA/FEDER, and by the Agencia Estatal de InvestigaciΓ³n of the Spanish Ministry of Economy and Competitiveness and the European Social Fund through the β€œRamΓ³n y Cajal” program (RYC2019-028112-I)

    ΠšΠΎΠ»ΠΈΡ‡Π΅ΡΡ‚Π²Π΅Π½Π½Ρ‹ΠΉ Π°Π½Π°Π»ΠΈΠ· лСксики русского WordNet ΠΈ викисловарСй

    Get PDF
    A quantitative analysis of the Russian lexicon was performed in the paper. The thesaurus Russian WordNet and two electronic dictionaries are under examination: the Russian Wiktionary and the English Wiktionary. The quantity of Russian words and their meanings (senses) according to the parts of speech are compared. The distribution of words for each part of speech, the quantity of monosemous and polysemous words and the distribution of words by number of meanings were calculated and compared across these dictionaries. The analysis of the distribution of words by number of meanings revealed a problem that too few or no ambigous Russian words with the number of meanings more than 4 are presented in the English Wiktionary (in comparison with the Russian Wiktionary). The analysis shows that the average polysemy, the number and the distribution of word senses follow similar patterns in both expert and collaborative resources with relatively minor differences.Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ Π²Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ количСствСнный Π°Π½Π°Π»ΠΈΠ· лСксики русского языка ΠΏΠΎ Π΄Π°Π½Π½Ρ‹ΠΌ тСзауруса Русский WordNet ΠΈ Π΄Π²ΡƒΡ… элСктронных словарСй (Русский Π’ΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°Ρ€ΡŒ ΠΈ Английский Π’ΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°Ρ€ΡŒ). БравниваСтся ΠΎΠ±ΡŠΡ‘ΠΌ словарСй ΠΈ распрСдСлСниС слов русского языка ΠΏΠΎ частям Ρ€Π΅Ρ‡ΠΈ. ΠŸΡ€ΠΈΠ²ΠΎΠ΄ΠΈΡ‚ΡΡ ΡΠΎΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠ΅ ΠΌΠ½ΠΎΠ³ΠΎΠ·Π½Π°Ρ‡Π½Ρ‹Ρ… слов ΠΈ слов с ΠΎΠ΄Π½ΠΈΠΌ Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ΠΌ, Π° Ρ‚Π°ΠΊΠΆΠ΅ распрСдСлСниС русских слов ΠΏΠΎ числу Π·Π½Π°Ρ‡Π΅Π½ΠΈΠΉ. Анализ распрСдСлСния числа Π·Π½Π°Ρ‡Π΅Π½ΠΈΠΉ выявил ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌΡƒ Английского Викисловаря – отсутствиС ΠΈΠ»ΠΈ нСдостаточная ΠΏΡ€ΠΎΡ€Π°Π±ΠΎΡ‚ΠΊΠ° ΠΌΠ½ΠΎΠ³ΠΎΠ·Π½Π°Ρ‡Π½Ρ‹Ρ… русских слов с числом Π·Π½Π°Ρ‡Π΅Π½ΠΈΠΉ большС Ρ‡Π΅Ρ‚Ρ‹Ρ€Ρ‘Ρ… (ΠΏΠΎ ΡΡ€Π°Π²Π½Π΅Π½ΠΈΡŽ со словами Русского Викисловаря). ЭкспСримСнты ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°ΡŽΡ‚, Ρ‡Ρ‚ΠΎ лингвистичСскиС рСсурсы, созданныС энтузиастами, Π΄Π΅ΠΌΠΎΠ½ΡΡ‚Ρ€ΠΈΡ€ΡƒΡŽΡ‚ Ρ‚Π΅ ΠΆΠ΅ закономСрности, Ρ‡Ρ‚ΠΎ ΠΈ Ρ‚Ρ€Π°Π΄ΠΈΡ†ΠΈΠΎΠ½Π½Ρ‹Π΅ словари
    corecore