3 research outputs found

    A hybrid approach for automatic extraction of bilingual Multiword Expressions from parallel corpora

    No full text
    International audienceSpecific-domain bilingual lexicons play an important role for domain adaptation in machine translation. The entries of these types of lexicons are mostly composed of MultiWord Expressions (MWEs). The manual construction of MWEs bilingual lexicons is costly and time-consuming. We often use word alignment approaches to automatically construct bilingual lexicons of MWEs from parallel corpora. We present in this paper a hybrid approach to extract and align MWEs from parallel corpora in a one-step process. We formalize the alignment process as an integer linear programming problem in order to find an approximated optimal solution. This process generates lists of MWEs with their translations, which are then filtered using linguistic patterns for the construction of the bilingual lexicons of MWEs. We evaluate the bilingual lexicons of MWEs produced by this approach using two methods: a manual evaluation of the alignment quality and an evaluation of the impact of this alignment on the translation quality of the phrase-based statistical machine translation system Moses. We experimentally show that the integration of the bilingual MWEs and their linguistic information into the translation model improves the performance of Moses

    Uticaj klasifikacije teksta na primene u obradi prirodnih jezika

    Get PDF
    The main goal of this dissertation is to put different text classification tasks in the same frame, by mapping the input data into the common vector space of linguistic attributes. Subsequently, several classification problems of great importance for natural language processing are solved by applying the appropriate classification algorithms. The dissertation deals with the problem of validation of bilingual translation pairs, so that the final goal is to construct a classifier which provides a substitute for human evaluation and which decides whether the pair is a proper translation between the appropriate languages by means of applying a variety of linguistic information and methods. In dictionaries it is useful to have a sentence that demonstrates use for a particular dictionary entry. This task is called the classification of good dictionary examples. In this thesis, a method is developed which automatically estimates whether an example is good or bad for a specific dictionary entry. Two cases of short message classification are also discussed in this dissertation. In the first case, classes are the authors of the messages, and the task is to assign each message to its author from that fixed set. This task is called authorship identification. The other observed classification of short messages is called opinion mining, or sentiment analysis. Starting from the assumption that a short message carries a positive or negative attitude about a thing, or is purely informative, classes can be: positive, negative and neutral. These tasks are of great importance in the field of natural language processing and the proposed solutions are language-independent, based on machine learning methods: support vector machines, decision trees and gradient boosting. For all of these tasks, a demonstration of the effectiveness of the proposed methods is shown on for the Serbian language.Osnovni cilj disertacije je stavljanje različitih zadataka klasifikacije teksta u isti okvir, preslikavanjem ulaznih podataka u isti vektorski prostor lingvističkih atributa..

    Terminology development in power engineering based on natural language processing methods Π Π°Π·Π²ΠΈΡ‚ΠΈΠ΅ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΠΈ Π² энСргСтикС Π½Π° основС ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ СстСствСнного языка

    Get PDF
    Π£ ΠΎΠ²ΠΎΠΌ Ρ€Π°Π΄Ρƒ Π°Π½Π°Π»ΠΈΠ·ΠΈΡ€Π° сС Ρ€Π°Π·Π²ΠΎΡ˜ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π΅ ΠΈΠ· области Π΅Π»Π΅ΠΊΡ‚Ρ€ΠΎΠ΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠ΅ ΠΏΡ€ΠΈΠΌΠ΅Π½ΠΎΠΌ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° ΠΎΠ±Ρ€Π°Π΄Π΅ ΠΏΡ€ΠΈΡ€ΠΎΠ΄Π½ΠΈΡ… јСзика. Π Π°Π΄ јС ΠΏΠΎΠ΄Π΅Ρ™Π΅Π½ Π½Π° осам ΠΏΠΎΠ³Π»Π°Π²Ρ™Π° ΠΈ ΠΎΠ±Ρ€Π°Ρ’ΡƒΡ˜Π΅ ΠΎΠΏΡˆΡ‚Ρƒ Ρ‚Π΅ΠΎΡ€ΠΈΡ˜Ρƒ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π΅ ΠΊΠ°ΠΎ Π½Π°ΡƒΡ‡Π½ΠΎΠ³ Π΄ΠΎΠΌΠ΅Π½Π°, ΠΌΠ΅Ρ’ΡƒΠ½Π°Ρ€ΠΎΠ΄Π½Π΅ ΠΈ Π΄ΠΎΠΌΠ°Ρ›Π΅ ΠΈΠ½ΡΡ‚ΠΈΡ‚ΡƒΡ†ΠΈΡ˜Π΅ којС ΡƒΡ‡Π΅ΡΡ‚Π²ΡƒΡ˜Ρƒ Ρƒ њСном ΠΊΡ€Π΅ΠΈΡ€Π°ΡšΡƒ, Ρ€Π°Π·Π²ΠΎΡ˜ ΡΠΏΠ΅Ρ†ΠΈΡ˜Π°Π»ΠΈΠ·ΠΎΠ²Π°Π½Π΅ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π΅ Π½Π° српском Ρ˜Π΅Π·ΠΈΠΊΡƒ, ΠΏΡ€ΠΈΠΌΠ΅Π½Ρƒ корпуснС лингвистикС Ρƒ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΡˆΠΊΠΈΠΌ ΠΈΡΡ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠΈΠΌΠ°, ΠΊΠ°ΠΎ ΠΈ корпуснС Π°Π»Π°Ρ‚Π΅ ΠΈ Ρ˜Π΅Π·ΠΈΡ‡ΠΊΠ΅ рСсурсС који сС ΠΏΡ€ΠΈΠΌΠ΅ΡšΡƒΡ˜Ρƒ ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΎΠ±Ρ€Π°Π΄Π΅ тСкстова корпуса. ΠŸΠ°Ρ€Π°Π»Π΅Π»Π½ΠΈ корпуси ΠΏΡ€Π΅Π΄ΡΡ‚Π°Π²Ρ™Π°Ρ˜Ρƒ Π΄Π²ΠΎΡ˜Π΅Π·ΠΈΡ‡Π½Π΅ односно Π²ΠΈΡˆΠ΅Ρ˜Π΅Π·ΠΈΡ‡Π½Π΅ корпусС тСкстова који су Π²Π΅ΠΎΠΌΠ° Π·Π½Π°Ρ‡Π°Ρ˜Π½ΠΈ Ρƒ лингвистичким ΠΈΡΡ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠΈΠΌΠ°. Развој ΠΏΠ°Ρ€Π°Π»Π΅Π»Π½ΠΎΠ³ корпуса тСкстова ΠΈΠ· Π΄ΠΎΠΌΠ΅Π½Π° Π΅Π»Π΅ΠΊΡ‚Ρ€ΠΎΠ΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠ΅ (ElEner) Π·Π°ΠΏΠΎΡ‡Π΅Ρ‚ јС ΡƒΠΏΠΎΡ€Π΅Π΄ΠΎ са ΠΈΠ·Ρ€Π°Π΄ΠΎΠΌ ΠΎΠ²Π΅ докторскС Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜Π΅. Π£ поступку њСнС ΠΈΠ·Ρ€Π°Π΄Π΅, Π°Π½Π°Π»ΠΈΠ·ΠΈΡ€Π°Π½ΠΎ јС 76 Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Π°Ρ‚Π° насталим Ρƒ ΠΏΠ΅Ρ€ΠΈΠΎΠ΄Ρƒ ΠΎΠ΄ 2005. Π΄ΠΎ 2016. Π³ΠΎΠ΄ΠΈΠ½Π΅, који ΠΈ Ρ‡ΠΈΠ½Π΅ овај корпус. Π Π΅Ρ‡ јС тСкстовима Π·Π°ΠΊΠΎΠ½ΠΎΠ΄Π°Π²Π½Π΅, Ρ‚Π΅Ρ…Π½ΠΈΡ‡ΠΊΠ΅ ΠΈ Π½Π°ΡƒΡ‡Π½Π΅ ΠΏΡ€ΠΈΡ€ΠΎΠ΄Π΅ Π½Π° српском ΠΈ СнглСском Ρ˜Π΅Π·ΠΈΠΊΡƒ. Π£ Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜ΠΈ јС Ρ‚Π΅ΠΌΠ΅Ρ™Π½ΠΎ Π°Π½Π°Π»ΠΈΠ·ΠΈΡ€Π°Π½ процСс ΠΎΠ΄Π°Π±ΠΈΡ€Π° ΠΈ ΠΏΡ€ΠΈΠΊΡƒΠΏΡ™Π°ΡšΠ° ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΡ˜Π°Π»Π° Π·Π° корпус, ΠΎΠ±Ρ€Π°Π΄Π° тСкстова ΠΏΡ€ΠΈΠΌΠ΅Π½ΠΎΠΌ ΠΎΠ΄Π³ΠΎΠ²Π°Ρ€Π°Ρ˜ΡƒΡ›ΠΈΡ… Ρ˜Π΅Π·ΠΈΡ‡ΠΊΠΈΡ… рСсурса ΠΈ Π°Π»Π°Ρ‚Π° Π·Π° српски ΠΈ СнглСски јСзик, ΠΏΠ°Ρ€Π°Π»Π΅Π»ΠΈΠ·Π°Ρ†ΠΈΡ˜Π° тСкстова, Π΅ΠΊΡΡ‚Ρ€Π°ΠΊΡ†ΠΈΡ˜Π° Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π΅ Π½Π° српском ΠΈ СнглСском Ρ˜Π΅Π·ΠΈΠΊΡƒ, ΠΏΠΎΡ€Π°Π²Π½Π°Π²Π°ΡšΠ΅ ΠΈ ΡƒΠΏΠ°Ρ€ΠΈΠ²Π°ΡšΠ΅ ΠΊΠΎΠΌΠ°Π΄Π° ΠΈ Ρ‚Π΅Ρ€ΠΌΠΈΠ½Π°, ΠΊΠ°ΠΎ ΠΈ Π΅Π²Π°Π»ΡƒΠ°Ρ†ΠΈΡ˜Π° Ρ€Π΅Π·ΡƒΠ»Ρ‚Π°Ρ‚Π° Π΄ΠΎΠ±ΠΈΡ˜Π΅Π½ΠΈΡ… Ρ‚Π΅Ρ€ΠΌΠΈΠ½Π° ΠΈ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΡˆΠΊΠΈΡ… ΠΏΠ°Ρ€ΠΎΠ²Π°. Након Π·Π°Π²Ρ€ΡˆΠ΅Π½ΠΎΠ³ процСса Π΅Π²Π°Π»ΡƒΠ°Ρ†ΠΈΡ˜Π΅, сви исправно Π΅Π²Π°Π»ΡƒΠΈΡ€Π°Π½ΠΈ ΠΏΠ°Ρ€ΠΎΠ²ΠΈ су ΡƒΠΊΡ™ΡƒΡ‡Π΅Π½ΠΈ Ρƒ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΡˆΠΊΡƒ Π±Π°Π·Ρƒ ΠΏΠΎΠ΄Π°Ρ‚Π°ΠΊΠ° Termi, која ΠΏΠΎΠ΄Ρ€ΠΆΠ°Π²Π° Ρ€Π°Π·Π²ΠΎΡ˜ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΡˆΠΊΠΈΡ… Ρ€Π΅Ρ‡Π½ΠΈΠΊΠ° Ρƒ Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚ΠΈΠΌ областима (ΠΌΠ°Ρ‚Π΅ΠΌΠ°Ρ‚ΠΈΠΊΠ°, рачунарство, рударство, библиотСкарство, рачунарска лингвистика, Π΅Π»Π΅ΠΊΡ‚Ρ€ΠΎΠ΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠ°, ΠΈΡ‚Π΄.), ΠΊΠ°ΠΎ ΠΈ ΠΎΠ±Ρ€Π°Π΄Ρƒ ΠΈ ΠΏΡ€Π΅Π·Π΅Π½Ρ‚Π°Ρ†ΠΈΡ˜Ρƒ Ρ‚Π΅Ρ€ΠΌΠΈΠ½Π° Π½Π° српском, СнглСском, Π½Π΅ΠΌΠ°Ρ‡ΠΊΠΎΠΌ ΠΈ француском Ρ˜Π΅Π·ΠΈΠΊΡƒ, ΠΈ ΠΈΠ·Π²ΠΎΠ· Ρƒ Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚Π΅ ΠΈΠ·Π»Π°Π·Π½Π΅ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚Π΅. Ова Π±Π°Π·Π° јС Ρ‚Π°ΠΊΠΎ Π΄ΠΎΠΏΡƒΡšΠ΅Π½Π° Π½ΠΎΠ²ΠΈΠΌ лСксичким Ρ˜Π΅Π΄ΠΈΠ½ΠΈΡ†Π°ΠΌΠ° ΠΈΠ· Π΄ΠΎΠΌΠ΅Π½Π° Π΅Π»Π΅ΠΊΡ‚Ρ€ΠΎΠ΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠ΅ Π½Π° српском ΠΈ СнглСском Ρ˜Π΅Π·ΠΈΠΊΡƒ, ΠΊΠ°ΠΎ ΠΈ ΡšΠΈΡ…ΠΎΠ²ΠΈΠΌ синонимима. Π”ΠΎΠ±ΠΈΡ˜Π΅Π½Π° листа ΠΏΡ€Π΅Π²ΠΎΠ΄Π½ΠΈΡ… ΠΏΠ°Ρ€ΠΎΠ²Π° послуТила јС Π·Π° Π³Π΅Π½Π΅Ρ€ΠΈΡΠ°ΡšΠ΅ Π΄Π²ΠΎΡ˜Π΅Π·ΠΈΡ‡Π½ΠΎΠ³ Ρ€Π΅Ρ‡Π½ΠΈΠΊΠ° ΠΈΠ· Π΄ΠΎΠΌΠ΅Π½Π° Π΅Π»Π΅ΠΊΡ‚Ρ€ΠΎΠ΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠ΅. ΠŸΡ€ΠΎΠΈΠ·Π²Π΅Π΄Π΅Π½ΠΈ ΠΏΠ°Ρ€Π°Π»Π΅Π»Π½ΠΈ корпус ElEner ΡΠΌΠ΅ΡˆΡ‚Π΅Π½ јС Ρƒ Π΄ΠΈΠ³ΠΈΡ‚Π°Π»Π½Ρƒ Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊΡƒ Π‘ΠΈΠ±Π»ΠΈΡˆΠ° која ΠΎΠΌΠΎΠ³ΡƒΡ›Π°Π²Π° Π²ΠΈΡˆΠ΅Ρ˜Π΅Π·ΠΈΡ‡ΠΊΠΎ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ Π²Π΅Π»ΠΈΠΊΠΈΡ… ΠΊΠΎΠ»Π΅ΠΊΡ†ΠΈΡ˜Π° ΠΏΠΎΡ€Π°Π²Π½Π°Ρ‚ΠΈΡ… тСкстова. ΠŸΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ ΠΎΠ²Π΅ Π΄ΠΈΠ³ΠΈΡ‚Π°Π»Π½Π΅ Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊΠ΅ ΠΎΠ±Π°Π²Ρ™Π° сС ΠΏΠΎΠΌΠΎΡ›Ρƒ лСксичких рСсурса који ΠΎΠΌΠΎΠ³ΡƒΡ›Π°Π²Π°Ρ˜Ρƒ ΠΌΠΎΡ€Ρ„ΠΎΠ»ΠΎΡˆΠΊΠΎ ΠΈ сСмантичко ΠΏΡ€ΠΎΡˆΠΈΡ€Π΅ΡšΠ΅ постављСних ΡƒΠΏΠΈΡ‚Π°. Π”ΠΎΠ±ΠΈΡ˜Π΅Π½ΠΈ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΡˆΠΊΠΈ ΠΏΠ°Ρ€ΠΎΠ²ΠΈ ΠΏΡ€Π΅Π΄ΡΡ‚Π°Π²Ρ™Π°Ρ˜Ρƒ основу Π·Π° Ρ€Π°Π·Π²ΠΎΡ˜ Π½ΠΎΠ²ΠΎΠ³ ΠΌΠΎΠ΄Π΅Ρ€Π½ΠΎΠ³ Ρ€Π΅Ρ‡Π½ΠΈΠΊΠ° ΠΈΠ· области Π΅Π»Π΅ΠΊΡ‚Ρ€ΠΎΠ΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠ΅, Ρ‡ΠΈΠΌΠ΅ сС ΡƒΡ˜Π΅Π΄Π½ΠΎ ΠΎΡ‚Π²Π°Ρ€Π° могућност ΠΈ Π·Π° ΡƒΠ½Π°ΠΏΡ€Π΅Ρ’Π΅ΡšΠ΅ ΠΈ ΠΏΡ€ΠΎΡˆΠΈΡ€ΠΈΠ²Π°ΡšΠ΅ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΡˆΠΊΠ΅ Π±Π°Π·Π΅ Π•Π»Π΅ΠΊΡ‚Ρ€ΠΎΠΏΠ΅Π΄ΠΈΡ˜Π°. ΠŸΠΎΡΡ‚ΡƒΠΏΠ°ΠΊ ΠΎΠ±Ρ€Π°Π΄Π΅ тСкстова ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½ ΠΎΠ²ΠΎΠΌ Π΄ΠΈΡΠ΅Ρ€Ρ‚Π°Ρ†ΠΈΡ˜ΠΎΠΌ ΠΏΠΎΠΊΠ°Π·Π°ΠΎ сС ΠΏΡ€ΠΈΠΌΠ΅Π½Ρ™ΠΈΠ²ΠΈΠΌ ΠΈ корисним ΠΈ Ρƒ Π΄Ρ€ΡƒΠ³ΠΈΠΌ Π΄ΠΎΠΌΠ΅Π½ΠΈΠΌΠ°. Π£ ΠΈΡΡ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠΈΠΌΠ° која Ρ›Π΅ услСдити, Ρ†ΠΈΡ™ јС Π΄Π° сС ΠΏΠΎΠ±ΠΎΡ™ΡˆΠ° ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π° Ρ‚Π΅Ρ…Π½ΠΈΠΊΠ° ΡƒΠΊΡ™ΡƒΡ‡ΠΈΠ²Π°ΡšΠ΅ΠΌ аутоматскС Π²Π°Π»ΠΈΠ΄Π°Ρ†ΠΈΡ˜Π΅ Π΄ΠΎΠ±ΠΈΡ˜Π΅Π½ΠΈΡ… Π΄Π²ΠΎΡ˜Π΅Π·ΠΈΡ‡Π½ΠΈΡ… Ρ‚Π΅Ρ€ΠΌΠΈΠ½Π° ΠΊΠ°Π½Π΄ΠΈΠ΄Π°Ρ‚Π° Ρƒ ΠΏΠΎΡΡ‚ΠΎΡ˜Π΅Ρ›Ρƒ ΠΏΡ€ΠΎΡ†Π΅Π΄ΡƒΡ€Ρƒ, Π½Π° основу Π½Π°Ρ˜ΡΠ°Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈΡ˜ΠΈΡ… Ρ‚Π΅Ρ…Π½ΠΈΠΊΠ° машинског ΡƒΡ‡Π΅ΡšΠ°.This paper analyzes terminology development in power engineering domain using natural language processing methods. The paper is divided into eight chapters and deals with the theory of terminology as an academic field in general, with international and domestic institutions involved in terminology development, development of specialized terminology within power engineering domain in Serbian language, the application of corpus linguistics in terminological research, as well as corpus processing tools and language resources. Parallel corpora are bilingual or multilingual corpora of texts that are very important in linguistic research. The development of a parallel corpus composed of texts in power engineering domain (ElEner) started with the preparation of this doctoral dissertation. The corpus is composed of technical, scientific and legislative texts both in Serbian and English published from 2006 until 2015. The dissertation thoroughly analyzes the process of text selection and collection, text processing techniques using appropriate language resources and tools for Serbian and English, parallelization of texts, extraction of terminology in Serbian and English, alignment and matching of chunks and terms, and evaluation of obtained results. After the evaluation process is completed, all correctly evaluated pairs are included in the Termi terminology database, which supports the development of terminological dictionaries in various fields (mathematics, computing, mining, librarianship, computational linguistics, power engineering, etc.), as well as processing and presentation of terms in Serbian, English, German and French and their export to various output formats. This database is thus upgraded with new lexical units and synonyms from the power engineering domain in Serbian and English. The obtained list of translation pairs was used for power engineering bilingual dictionary development. The new aligned ElEner corpus is stored in digital library BibliΕ‘a, which enables multilingual search of large collections of aligned texts. The search of this digital library is performed using lexical resources that enable morphological and semantic expansion of the queries. The obtained terminological pairs represent the basis for the development of a new modern dictionary in the field of power engineering, and provide an opportunity for the improvement and expansion of the terminology base of Electropedia. The text processing procedure proposed by this dissertation has proven to be applicable and useful for application in other domains as well. In the future research, the goal is to improve the proposed technique by including automatic validation of the obtained bilingual terms of the candidates in this routine, based on the state-of-the-art machine learning techniques
    corecore