23 research outputs found
Lemmatic machine translation
Abstract Statistical MT is limited by reliance on large parallel corpora. We propose Lemmatic MT, a new paradigm that extends MT to a far broader set of languages, but requires substantial manual encoding effort. We present PANLINGUAL TRANSLATOR, a prototype Lemmatic MT system with high translation adequacy on 59% to 99% of sentences (average 84%) on a sample of 6 language pairs that Google Translate (GT) handles. GT ranged from 34% to 93%, average 65%. PANLINGUAL TRANSLATOR also had high translation adequacy on 27% to 82% of sentences (average 62%) from a sample of 5 language pairs not handled by GT
Linking Discourse Marker Inventories
The paper describes the first comprehensive edition of machine-readable discourse marker lexicons. Discourse markers such as and, because, but, though or thereafter are essential communicative signals in human conversation, as they indicate how an utterance relates to its communicative context. As much of this information is implicit or expressed differently in different languages, discourse parsing, context-adequate natural language generation and machine translation are considered particularly challenging aspects of Natural Language Processing. Providing this data in machine-readable, standard-compliant form will thus facilitate such technical tasks, and moreover, allow to explore techniques for translation inference to be applied to this particular group of lexical resources that was previously largely neglected in the context of Linguistic Linked (Open) Data
Linking discourse marker inventories
The paper describes the first comprehensive edition of machine-readable discourse marker lexicons. Discourse markers such as and, because, but, though or thereafter are essential communicative signals in human conversation, as they indicate how an utterance relates to its communicative context. As much of this information is implicit or expressed differently in different languages, discourse parsing, context-adequate natural language generation and machine translation are considered particularly challenging aspects of Natural Language Processing. Providing this data in machine-readable, standard-compliant form will thus facilitate such technical tasks, and moreover, allow to explore techniques for translation inference to be applied to this particular group of lexical resources that was previously largely neglected in the context of Linguistic Linked (Open) Data
ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠΉ Π°Π½Π°Π»ΠΈΠ· Π»Π΅ΠΊΡΠΈΠΊΠΈ Π°Π½Π³Π»ΠΈΠΉΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° Π² Π²ΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°ΡΡΡ ΠΈ Wordnet.
A quantitative analysis of the English lexicon was performed in the paper. The three electronic dictionaries are under examination: the English Wiktionary, WordNet, and the Russian Wiktionary. The quantity of English words and their meanings (senses) are calculated. The distribution of words for each part of speech, the quantity of monosemous and polysemous words and the distribution of words by number of meanings were calculated and compared across these dictionaries. The analysis shows that the average polysemy, the number and the distribution of word senses follow similar patterns in both expert and collaborative resources with relatively minor differences.Π ΡΠ°Π±ΠΎΡΠ΅ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠΉ Π°Π½Π°Π»ΠΈΠ· Π»Π΅ΠΊΡΠΈΠΊΠΈ Π°Π½Π³Π»ΠΈΠΉΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΠΏΠΎ Π΄Π°Π½Π½ΡΠΌ ΡΡΡΡ
ΡΠ»Π΅ΠΊΡΡΠΎΠ½Π½ΡΡ
ΡΠ»ΠΎΠ²Π°ΡΠ΅ΠΉ: ΠΠ½Π³Π»ΠΈΠΉΡΠΊΠΎΠ³ΠΎ ΠΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°ΡΡ, WordNet ΠΈ Π ΡΡΡΠΊΠΎΠ³ΠΎ ΠΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°ΡΡ. Π‘ΡΠ°Π²Π½ΠΈΠ²Π°Π΅ΡΡΡ ΠΎΠ±ΡΡΠΌ ΡΠ»ΠΎΠ²Π°ΡΠ΅ΠΉ ΠΈ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ ΡΠ»ΠΎΠ² Π°Π½Π³Π»ΠΈΠΉΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΠΏΠΎ ΡΠ°ΡΡΡΠΌ ΡΠ΅ΡΠΈ. ΠΡΠΈΠ²ΠΎΠ΄ΠΈΡΡΡ ΡΠΎΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠ΅ ΠΌΠ½ΠΎΠ³ΠΎΠ·Π½Π°ΡΠ½ΡΡ
ΡΠ»ΠΎΠ² ΠΈ ΡΠ»ΠΎΠ² Ρ ΠΎΠ΄Π½ΠΈΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ΠΌ, Π° ΡΠ°ΠΊΠΆΠ΅ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ Π°Π½Π³Π»ΠΈΠΉΡΠΊΠΈΡ
ΡΠ»ΠΎΠ² ΠΏΠΎ ΡΠΈΡΠ»Ρ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ. ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΡ ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°ΡΡ, ΡΡΠΎ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΡΠ΅ΡΡΡΡΡ, ΡΠΎΠ·Π΄Π°Π½Π½ΡΠ΅ ΠΊΠ°ΠΊ ΡΠΊΡΠΏΠ΅ΡΡΠ°ΠΌΠΈ, ΡΠ°ΠΊ ΠΈ ΡΠ½ΡΡΠ·ΠΈΠ°ΡΡΠ°ΠΌΠΈ, ΠΏΠΎΠ΄ΡΠΈΠ½ΡΡΡΡΡ ΠΎΠ±ΡΠΈΠΌ Π·Π°ΠΊΠΎΠ½Π°ΠΌ
Bilingual dictionary generation and enrichment via graph exploration
In recent years, we have witnessed a steady growth of linguistic information represented and exposed as linked data on the Web. Such linguistic linked data have stimulated the development and use of openly available linguistic knowledge graphs, as is the case with the Apertium RDF, a collection of interconnected bilingual dictionaries represented and accessible through Semantic Web standards. In this work, we explore techniques that exploit the graph nature of bilingual dictionaries to automatically infer new links (translations). We build upon a cycle density based method: partitioning the graph into biconnected components for a speed-up, and simplifying the pipeline through a careful structural analysis that reduces hyperparameter tuning requirements. We also analyse the shortcomings of traditional evaluation metrics used for translation inference and propose to complement them with new ones, both-word precision (BWP) and both-word recall (BWR), aimed at being more informative of algorithmic improvements. Over twenty-seven language pairs, our algorithm produces dictionaries about 70% the size of existing Apertium RDF dictionaries at a high BWP of 85% from scratch within a minute. Human evaluation shows that 78% of the additional translations generated for dictionary enrichment are correct as well. We further describe an interesting use-case: inferring synonyms within a single language, on which our initial human-based evaluation shows an average accuracy of 84%. We release our tool as free/open-source software which can not only be applied to RDF data and Apertium dictionaries, but is also easily usable for other formats and communities.This work was partially funded by the PrΓͺt-Γ -LLOD project within the European Unionβs Horizon 2020 research and innovation programme under grant agreement no. 825182. This article is also based upon work from COST Action CA18209 NexusLinguarum, βEuropean network for Web-centred linguistic data scienceβ, supported by COST (European Cooperation in Science and Technology). It has been also partially supported by the Spanish projects TIN2016-78011-C4-3-R and PID2020-113903RB-I00 (AEI/FEDER, UE), by DGA/FEDER, and by the Agencia Estatal de InvestigaciΓ³n of the Spanish Ministry of Economy and Competitiveness and the European Social Fund through the βRamΓ³n y Cajalβ program (RYC2019-028112-I)
ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠΉ Π°Π½Π°Π»ΠΈΠ· Π»Π΅ΠΊΡΠΈΠΊΠΈ ΡΡΡΡΠΊΠΎΠ³ΠΎ WordNet ΠΈ Π²ΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°ΡΠ΅ΠΉ
A quantitative analysis of the Russian lexicon was performed in the paper. The thesaurus Russian WordNet and two electronic dictionaries are under examination: the Russian Wiktionary and the English Wiktionary. The quantity of Russian words and their meanings (senses) according to the parts of speech are compared. The distribution of words for each part of speech, the quantity of monosemous and polysemous words and the distribution of words by number of meanings were calculated and compared across these dictionaries. The analysis of the distribution of words by number of meanings revealed a problem that too few or no ambigous Russian words with the number of meanings more than 4 are presented in the English Wiktionary (in comparison with the Russian Wiktionary). The analysis shows that the average polysemy, the number and the distribution of word senses follow similar patterns in both expert and collaborative resources with relatively minor differences.Π ΡΠ°Π±ΠΎΡΠ΅ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠΉ Π°Π½Π°Π»ΠΈΠ· Π»Π΅ΠΊΡΠΈΠΊΠΈ ΡΡΡΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΠΏΠΎ Π΄Π°Π½Π½ΡΠΌ ΡΠ΅Π·Π°ΡΡΡΡΠ° Π ΡΡΡΠΊΠΈΠΉ WordNet ΠΈ Π΄Π²ΡΡ
ΡΠ»Π΅ΠΊΡΡΠΎΠ½Π½ΡΡ
ΡΠ»ΠΎΠ²Π°ΡΠ΅ΠΉ (Π ΡΡΡΠΊΠΈΠΉ ΠΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°ΡΡ ΠΈ ΠΠ½Π³Π»ΠΈΠΉΡΠΊΠΈΠΉ ΠΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°ΡΡ). Π‘ΡΠ°Π²Π½ΠΈΠ²Π°Π΅ΡΡΡ ΠΎΠ±ΡΡΠΌ ΡΠ»ΠΎΠ²Π°ΡΠ΅ΠΉ ΠΈ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ ΡΠ»ΠΎΠ² ΡΡΡΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΠΏΠΎ ΡΠ°ΡΡΡΠΌ ΡΠ΅ΡΠΈ. ΠΡΠΈΠ²ΠΎΠ΄ΠΈΡΡΡ ΡΠΎΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠ΅ ΠΌΠ½ΠΎΠ³ΠΎΠ·Π½Π°ΡΠ½ΡΡ
ΡΠ»ΠΎΠ² ΠΈ ΡΠ»ΠΎΠ² Ρ ΠΎΠ΄Π½ΠΈΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ΠΌ, Π° ΡΠ°ΠΊΠΆΠ΅ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ ΡΡΡΡΠΊΠΈΡ
ΡΠ»ΠΎΠ² ΠΏΠΎ ΡΠΈΡΠ»Ρ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ. ΠΠ½Π°Π»ΠΈΠ· ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΠΈΡΠ»Π° Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ Π²ΡΡΠ²ΠΈΠ» ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ ΠΠ½Π³Π»ΠΈΠΉΡΠΊΠΎΠ³ΠΎ ΠΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°ΡΡ β ΠΎΡΡΡΡΡΡΠ²ΠΈΠ΅ ΠΈΠ»ΠΈ Π½Π΅Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½Π°Ρ ΠΏΡΠΎΡΠ°Π±ΠΎΡΠΊΠ° ΠΌΠ½ΠΎΠ³ΠΎΠ·Π½Π°ΡΠ½ΡΡ
ΡΡΡΡΠΊΠΈΡ
ΡΠ»ΠΎΠ² Ρ ΡΠΈΡΠ»ΠΎΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ Π±ΠΎΠ»ΡΡΠ΅ ΡΠ΅ΡΡΡΡΡ
(ΠΏΠΎ ΡΡΠ°Π²Π½Π΅Π½ΠΈΡ ΡΠΎ ΡΠ»ΠΎΠ²Π°ΠΌΠΈ Π ΡΡΡΠΊΠΎΠ³ΠΎ ΠΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°ΡΡ). ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΡ ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°ΡΡ, ΡΡΠΎ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΡΠ΅ΡΡΡΡΡ, ΡΠΎΠ·Π΄Π°Π½Π½ΡΠ΅ ΡΠ½ΡΡΠ·ΠΈΠ°ΡΡΠ°ΠΌΠΈ, Π΄Π΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΡΡ ΡΠ΅ ΠΆΠ΅ Π·Π°ΠΊΠΎΠ½ΠΎΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ, ΡΡΠΎ ΠΈ ΡΡΠ°Π΄ΠΈΡΠΈΠΎΠ½Π½ΡΠ΅ ΡΠ»ΠΎΠ²Π°ΡΠΈ