3 research outputs found
A hybrid approach for automatic extraction of bilingual Multiword Expressions from parallel corpora
International audienceSpecific-domain bilingual lexicons play an important role for domain adaptation in machine translation. The entries of these types of lexicons are mostly composed of MultiWord Expressions (MWEs). The manual construction of MWEs bilingual lexicons is costly and time-consuming. We often use word alignment approaches to automatically construct bilingual lexicons of MWEs from parallel corpora. We present in this paper a hybrid approach to extract and align MWEs from parallel corpora in a one-step process. We formalize the alignment process as an integer linear programming problem in order to find an approximated optimal solution. This process generates lists of MWEs with their translations, which are then filtered using linguistic patterns for the construction of the bilingual lexicons of MWEs. We evaluate the bilingual lexicons of MWEs produced by this approach using two methods: a manual evaluation of the alignment quality and an evaluation of the impact of this alignment on the translation quality of the phrase-based statistical machine translation system Moses. We experimentally show that the integration of the bilingual MWEs and their linguistic information into the translation model improves the performance of Moses
Uticaj klasifikacije teksta na primene u obradi prirodnih jezika
The main goal of this dissertation is to put different text classification tasks in
the same frame, by mapping the input data into the common vector space of linguistic
attributes. Subsequently, several classification problems of great importance for natural
language processing are solved by applying the appropriate classification algorithms.
The dissertation deals with the problem of validation of bilingual translation pairs, so
that the final goal is to construct a classifier which provides a substitute for human evaluation
and which decides whether the pair is a proper translation between the appropriate
languages by means of applying a variety of linguistic information and methods.
In dictionaries it is useful to have a sentence that demonstrates use for a particular dictionary
entry. This task is called the classification of good dictionary examples. In this thesis,
a method is developed which automatically estimates whether an example is good or bad
for a specific dictionary entry.
Two cases of short message classification are also discussed in this dissertation. In the
first case, classes are the authors of the messages, and the task is to assign each message
to its author from that fixed set. This task is called authorship identification. The other
observed classification of short messages is called opinion mining, or sentiment analysis.
Starting from the assumption that a short message carries a positive or negative attitude
about a thing, or is purely informative, classes can be: positive, negative and neutral.
These tasks are of great importance in the field of natural language processing and the
proposed solutions are language-independent, based on machine learning methods: support
vector machines, decision trees and gradient boosting. For all of these tasks, a
demonstration of the effectiveness of the proposed methods is shown on for the Serbian
language.Osnovni cilj disertacije je stavljanje razliΔitih zadataka klasifikacije teksta u
isti okvir, preslikavanjem ulaznih podataka u isti vektorski prostor lingvistiΔkih atributa..
Terminology development in power engineering based on natural language processing methods Π Π°Π·Π²ΠΈΡΠΈΠ΅ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΠΈ Π² ΡΠ½Π΅ΡΠ³Π΅ΡΠΈΠΊΠ΅ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ Π΅ΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ°
Π£ ΠΎΠ²ΠΎΠΌ ΡΠ°Π΄Ρ Π°Π½Π°Π»ΠΈΠ·ΠΈΡΠ° ΡΠ΅ ΡΠ°Π·Π²ΠΎΡ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡΠ΅ ΠΈΠ· ΠΎΠ±Π»Π°ΡΡΠΈ Π΅Π»Π΅ΠΊΡΡΠΎΠ΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠ΅ ΠΏΡΠΈΠΌΠ΅Π½ΠΎΠΌ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠ±ΡΠ°Π΄Π΅ ΠΏΡΠΈΡΠΎΠ΄Π½ΠΈΡ
ΡΠ΅Π·ΠΈΠΊΠ°. Π Π°Π΄ ΡΠ΅ ΠΏΠΎΠ΄Π΅ΡΠ΅Π½ Π½Π° ΠΎΡΠ°ΠΌ ΠΏΠΎΠ³Π»Π°Π²ΡΠ° ΠΈ ΠΎΠ±ΡΠ°ΡΡΡΠ΅ ΠΎΠΏΡΡΡ ΡΠ΅ΠΎΡΠΈΡΡ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡΠ΅ ΠΊΠ°ΠΎ Π½Π°ΡΡΠ½ΠΎΠ³ Π΄ΠΎΠΌΠ΅Π½Π°, ΠΌΠ΅ΡΡΠ½Π°ΡΠΎΠ΄Π½Π΅ ΠΈ Π΄ΠΎΠΌΠ°ΡΠ΅ ΠΈΠ½ΡΡΠΈΡΡΡΠΈΡΠ΅ ΠΊΠΎΡΠ΅ ΡΡΠ΅ΡΡΠ²ΡΡΡ Ρ ΡΠ΅Π½ΠΎΠΌ ΠΊΡΠ΅ΠΈΡΠ°ΡΡ, ΡΠ°Π·Π²ΠΎΡ ΡΠΏΠ΅ΡΠΈΡΠ°Π»ΠΈΠ·ΠΎΠ²Π°Π½Π΅ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡΠ΅ Π½Π° ΡΡΠΏΡΠΊΠΎΠΌ ΡΠ΅Π·ΠΈΠΊΡ, ΠΏΡΠΈΠΌΠ΅Π½Ρ ΠΊΠΎΡΠΏΡΡΠ½Π΅ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΠΊΠ΅ Ρ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΡΠΊΠΈΠΌ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠΈΠΌΠ°, ΠΊΠ°ΠΎ ΠΈ ΠΊΠΎΡΠΏΡΡΠ½Π΅ Π°Π»Π°ΡΠ΅ ΠΈ ΡΠ΅Π·ΠΈΡΠΊΠ΅ ΡΠ΅ΡΡΡΡΠ΅ ΠΊΠΎΡΠΈ ΡΠ΅ ΠΏΡΠΈΠΌΠ΅ΡΡΡΡ ΠΏΡΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΎΠ±ΡΠ°Π΄Π΅ ΡΠ΅ΠΊΡΡΠΎΠ²Π° ΠΊΠΎΡΠΏΡΡΠ°.
ΠΠ°ΡΠ°Π»Π΅Π»Π½ΠΈ ΠΊΠΎΡΠΏΡΡΠΈ ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ°ΡΡ Π΄Π²ΠΎΡΠ΅Π·ΠΈΡΠ½Π΅ ΠΎΠ΄Π½ΠΎΡΠ½ΠΎ Π²ΠΈΡΠ΅ΡΠ΅Π·ΠΈΡΠ½Π΅ ΠΊΠΎΡΠΏΡΡΠ΅ ΡΠ΅ΠΊΡΡΠΎΠ²Π° ΠΊΠΎΡΠΈ ΡΡ Π²Π΅ΠΎΠΌΠ° Π·Π½Π°ΡΠ°ΡΠ½ΠΈ Ρ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠΊΠΈΠΌ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠΈΠΌΠ°. Π Π°Π·Π²ΠΎΡ ΠΏΠ°ΡΠ°Π»Π΅Π»Π½ΠΎΠ³ ΠΊΠΎΡΠΏΡΡΠ° ΡΠ΅ΠΊΡΡΠΎΠ²Π° ΠΈΠ· Π΄ΠΎΠΌΠ΅Π½Π° Π΅Π»Π΅ΠΊΡΡΠΎΠ΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠ΅ (ElEner) Π·Π°ΠΏΠΎΡΠ΅Ρ ΡΠ΅ ΡΠΏΠΎΡΠ΅Π΄ΠΎ ΡΠ° ΠΈΠ·ΡΠ°Π΄ΠΎΠΌ ΠΎΠ²Π΅ Π΄ΠΎΠΊΡΠΎΡΡΠΊΠ΅ Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠ΅. Π£ ΠΏΠΎΡΡΡΠΏΠΊΡ ΡΠ΅Π½Π΅ ΠΈΠ·ΡΠ°Π΄Π΅, Π°Π½Π°Π»ΠΈΠ·ΠΈΡΠ°Π½ΠΎ ΡΠ΅ 76 Π΄ΠΎΠΊΡΠΌΠ΅Π½Π°ΡΠ° Π½Π°ΡΡΠ°Π»ΠΈΠΌ Ρ ΠΏΠ΅ΡΠΈΠΎΠ΄Ρ ΠΎΠ΄ 2005. Π΄ΠΎ 2016. Π³ΠΎΠ΄ΠΈΠ½Π΅, ΠΊΠΎΡΠΈ ΠΈ ΡΠΈΠ½Π΅ ΠΎΠ²Π°Ρ ΠΊΠΎΡΠΏΡΡ. Π Π΅Ρ ΡΠ΅ ΡΠ΅ΠΊΡΡΠΎΠ²ΠΈΠΌΠ° Π·Π°ΠΊΠΎΠ½ΠΎΠ΄Π°Π²Π½Π΅, ΡΠ΅Ρ
Π½ΠΈΡΠΊΠ΅ ΠΈ Π½Π°ΡΡΠ½Π΅ ΠΏΡΠΈΡΠΎΠ΄Π΅ Π½Π° ΡΡΠΏΡΠΊΠΎΠΌ ΠΈ Π΅Π½Π³Π»Π΅ΡΠΊΠΎΠΌ ΡΠ΅Π·ΠΈΠΊΡ. Π£ Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠΈ ΡΠ΅ ΡΠ΅ΠΌΠ΅ΡΠ½ΠΎ Π°Π½Π°Π»ΠΈΠ·ΠΈΡΠ°Π½ ΠΏΡΠΎΡΠ΅Ρ ΠΎΠ΄Π°Π±ΠΈΡΠ° ΠΈ ΠΏΡΠΈΠΊΡΠΏΡΠ°ΡΠ° ΠΌΠ°ΡΠ΅ΡΠΈΡΠ°Π»Π° Π·Π° ΠΊΠΎΡΠΏΡΡ, ΠΎΠ±ΡΠ°Π΄Π° ΡΠ΅ΠΊΡΡΠΎΠ²Π° ΠΏΡΠΈΠΌΠ΅Π½ΠΎΠΌ ΠΎΠ΄Π³ΠΎΠ²Π°ΡΠ°ΡΡΡΠΈΡ
ΡΠ΅Π·ΠΈΡΠΊΠΈΡ
ΡΠ΅ΡΡΡΡΠ° ΠΈ Π°Π»Π°ΡΠ° Π·Π° ΡΡΠΏΡΠΊΠΈ ΠΈ Π΅Π½Π³Π»Π΅ΡΠΊΠΈ ΡΠ΅Π·ΠΈΠΊ, ΠΏΠ°ΡΠ°Π»Π΅Π»ΠΈΠ·Π°ΡΠΈΡΠ° ΡΠ΅ΠΊΡΡΠΎΠ²Π°, Π΅ΠΊΡΡΡΠ°ΠΊΡΠΈΡΠ° ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡΠ΅ Π½Π° ΡΡΠΏΡΠΊΠΎΠΌ ΠΈ Π΅Π½Π³Π»Π΅ΡΠΊΠΎΠΌ ΡΠ΅Π·ΠΈΠΊΡ, ΠΏΠΎΡΠ°Π²Π½Π°Π²Π°ΡΠ΅ ΠΈ ΡΠΏΠ°ΡΠΈΠ²Π°ΡΠ΅ ΠΊΠΎΠΌΠ°Π΄Π° ΠΈ ΡΠ΅ΡΠΌΠΈΠ½Π°, ΠΊΠ°ΠΎ ΠΈ Π΅Π²Π°Π»ΡΠ°ΡΠΈΡΠ° ΡΠ΅Π·ΡΠ»ΡΠ°ΡΠ° Π΄ΠΎΠ±ΠΈΡΠ΅Π½ΠΈΡ
ΡΠ΅ΡΠΌΠΈΠ½Π° ΠΈ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΡΠΊΠΈΡ
ΠΏΠ°ΡΠΎΠ²Π°. ΠΠ°ΠΊΠΎΠ½ Π·Π°Π²ΡΡΠ΅Π½ΠΎΠ³ ΠΏΡΠΎΡΠ΅ΡΠ° Π΅Π²Π°Π»ΡΠ°ΡΠΈΡΠ΅, ΡΠ²ΠΈ ΠΈΡΠΏΡΠ°Π²Π½ΠΎ Π΅Π²Π°Π»ΡΠΈΡΠ°Π½ΠΈ ΠΏΠ°ΡΠΎΠ²ΠΈ ΡΡ ΡΠΊΡΡΡΠ΅Π½ΠΈ Ρ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΡΠΊΡ Π±Π°Π·Ρ ΠΏΠΎΠ΄Π°ΡΠ°ΠΊΠ° Termi, ΠΊΠΎΡΠ° ΠΏΠΎΠ΄ΡΠΆΠ°Π²Π° ΡΠ°Π·Π²ΠΎΡ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΡΠΊΠΈΡ
ΡΠ΅ΡΠ½ΠΈΠΊΠ° Ρ ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΠΌ ΠΎΠ±Π»Π°ΡΡΠΈΠΌΠ° (ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΠΊΠ°, ΡΠ°ΡΡΠ½Π°ΡΡΡΠ²ΠΎ, ΡΡΠ΄Π°ΡΡΡΠ²ΠΎ, Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠ°ΡΡΡΠ²ΠΎ, ΡΠ°ΡΡΠ½Π°ΡΡΠΊΠ° Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΠΊΠ°, Π΅Π»Π΅ΠΊΡΡΠΎΠ΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠ°, ΠΈΡΠ΄.), ΠΊΠ°ΠΎ ΠΈ ΠΎΠ±ΡΠ°Π΄Ρ ΠΈ ΠΏΡΠ΅Π·Π΅Π½ΡΠ°ΡΠΈΡΡ ΡΠ΅ΡΠΌΠΈΠ½Π° Π½Π° ΡΡΠΏΡΠΊΠΎΠΌ, Π΅Π½Π³Π»Π΅ΡΠΊΠΎΠΌ, Π½Π΅ΠΌΠ°ΡΠΊΠΎΠΌ ΠΈ ΡΡΠ°Π½ΡΡΡΠΊΠΎΠΌ ΡΠ΅Π·ΠΈΠΊΡ, ΠΈ ΠΈΠ·Π²ΠΎΠ· Ρ ΡΠ°Π·Π»ΠΈΡΠΈΡΠ΅ ΠΈΠ·Π»Π°Π·Π½Π΅ ΡΠΎΡΠΌΠ°ΡΠ΅. ΠΠ²Π° Π±Π°Π·Π° ΡΠ΅ ΡΠ°ΠΊΠΎ Π΄ΠΎΠΏΡΡΠ΅Π½Π° Π½ΠΎΠ²ΠΈΠΌ Π»Π΅ΠΊΡΠΈΡΠΊΠΈΠΌ ΡΠ΅Π΄ΠΈΠ½ΠΈΡΠ°ΠΌΠ° ΠΈΠ· Π΄ΠΎΠΌΠ΅Π½Π° Π΅Π»Π΅ΠΊΡΡΠΎΠ΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠ΅ Π½Π° ΡΡΠΏΡΠΊΠΎΠΌ ΠΈ Π΅Π½Π³Π»Π΅ΡΠΊΠΎΠΌ ΡΠ΅Π·ΠΈΠΊΡ, ΠΊΠ°ΠΎ ΠΈ ΡΠΈΡ
ΠΎΠ²ΠΈΠΌ ΡΠΈΠ½ΠΎΠ½ΠΈΠΌΠΈΠΌΠ°. ΠΠΎΠ±ΠΈΡΠ΅Π½Π° Π»ΠΈΡΡΠ° ΠΏΡΠ΅Π²ΠΎΠ΄Π½ΠΈΡ
ΠΏΠ°ΡΠΎΠ²Π° ΠΏΠΎΡΠ»ΡΠΆΠΈΠ»Π° ΡΠ΅ Π·Π° Π³Π΅Π½Π΅ΡΠΈΡΠ°ΡΠ΅ Π΄Π²ΠΎΡΠ΅Π·ΠΈΡΠ½ΠΎΠ³ ΡΠ΅ΡΠ½ΠΈΠΊΠ° ΠΈΠ· Π΄ΠΎΠΌΠ΅Π½Π° Π΅Π»Π΅ΠΊΡΡΠΎΠ΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠ΅.
ΠΡΠΎΠΈΠ·Π²Π΅Π΄Π΅Π½ΠΈ ΠΏΠ°ΡΠ°Π»Π΅Π»Π½ΠΈ ΠΊΠΎΡΠΏΡΡ ElEner ΡΠΌΠ΅ΡΡΠ΅Π½ ΡΠ΅ Ρ Π΄ΠΈΠ³ΠΈΡΠ°Π»Π½Ρ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΡ ΠΠΈΠ±Π»ΠΈΡΠ° ΠΊΠΎΡΠ° ΠΎΠΌΠΎΠ³ΡΡΠ°Π²Π° Π²ΠΈΡΠ΅ΡΠ΅Π·ΠΈΡΠΊΠΎ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ Π²Π΅Π»ΠΈΠΊΠΈΡ
ΠΊΠΎΠ»Π΅ΠΊΡΠΈΡΠ° ΠΏΠΎΡΠ°Π²Π½Π°ΡΠΈΡ
ΡΠ΅ΠΊΡΡΠΎΠ²Π°. ΠΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΠΎΠ²Π΅ Π΄ΠΈΠ³ΠΈΡΠ°Π»Π½Π΅ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠ΅ ΠΎΠ±Π°Π²ΡΠ° ΡΠ΅ ΠΏΠΎΠΌΠΎΡΡ Π»Π΅ΠΊΡΠΈΡΠΊΠΈΡ
ΡΠ΅ΡΡΡΡΠ° ΠΊΠΎΡΠΈ ΠΎΠΌΠΎΠ³ΡΡΠ°Π²Π°ΡΡ ΠΌΠΎΡΡΠΎΠ»ΠΎΡΠΊΠΎ ΠΈ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠΊΠΎ ΠΏΡΠΎΡΠΈΡΠ΅ΡΠ΅ ΠΏΠΎΡΡΠ°Π²ΡΠ΅Π½ΠΈΡ
ΡΠΏΠΈΡΠ°.
ΠΠΎΠ±ΠΈΡΠ΅Π½ΠΈ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΡΠΊΠΈ ΠΏΠ°ΡΠΎΠ²ΠΈ ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ°ΡΡ ΠΎΡΠ½ΠΎΠ²Ρ Π·Π° ΡΠ°Π·Π²ΠΎΡ Π½ΠΎΠ²ΠΎΠ³ ΠΌΠΎΠ΄Π΅ΡΠ½ΠΎΠ³ ΡΠ΅ΡΠ½ΠΈΠΊΠ° ΠΈΠ· ΠΎΠ±Π»Π°ΡΡΠΈ Π΅Π»Π΅ΠΊΡΡΠΎΠ΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠ΅, ΡΠΈΠΌΠ΅ ΡΠ΅ ΡΡΠ΅Π΄Π½ΠΎ ΠΎΡΠ²Π°ΡΠ° ΠΌΠΎΠ³ΡΡΠ½ΠΎΡΡ ΠΈ Π·Π° ΡΠ½Π°ΠΏΡΠ΅ΡΠ΅ΡΠ΅ ΠΈ ΠΏΡΠΎΡΠΈΡΠΈΠ²Π°ΡΠ΅ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΡΠΊΠ΅ Π±Π°Π·Π΅ ΠΠ»Π΅ΠΊΡΡΠΎΠΏΠ΅Π΄ΠΈΡΠ°. ΠΠΎΡΡΡΠΏΠ°ΠΊ ΠΎΠ±ΡΠ°Π΄Π΅ ΡΠ΅ΠΊΡΡΠΎΠ²Π° ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½ ΠΎΠ²ΠΎΠΌ Π΄ΠΈΡΠ΅ΡΡΠ°ΡΠΈΡΠΎΠΌ ΠΏΠΎΠΊΠ°Π·Π°ΠΎ ΡΠ΅ ΠΏΡΠΈΠΌΠ΅Π½ΡΠΈΠ²ΠΈΠΌ ΠΈ ΠΊΠΎΡΠΈΡΠ½ΠΈΠΌ ΠΈ Ρ Π΄ΡΡΠ³ΠΈΠΌ Π΄ΠΎΠΌΠ΅Π½ΠΈΠΌΠ°. Π£ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠΈΠΌΠ° ΠΊΠΎΡΠ° ΡΠ΅ ΡΡΠ»Π΅Π΄ΠΈΡΠΈ, ΡΠΈΡ ΡΠ΅ Π΄Π° ΡΠ΅ ΠΏΠΎΠ±ΠΎΡΡΠ° ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° ΡΠ΅Ρ
Π½ΠΈΠΊΠ° ΡΠΊΡΡΡΠΈΠ²Π°ΡΠ΅ΠΌ Π°ΡΡΠΎΠΌΠ°ΡΡΠΊΠ΅ Π²Π°Π»ΠΈΠ΄Π°ΡΠΈΡΠ΅ Π΄ΠΎΠ±ΠΈΡΠ΅Π½ΠΈΡ
Π΄Π²ΠΎΡΠ΅Π·ΠΈΡΠ½ΠΈΡ
ΡΠ΅ΡΠΌΠΈΠ½Π° ΠΊΠ°Π½Π΄ΠΈΠ΄Π°ΡΠ° Ρ ΠΏΠΎΡΡΠΎΡΠ΅ΡΡ ΠΏΡΠΎΡΠ΅Π΄ΡΡΡ, Π½Π° ΠΎΡΠ½ΠΎΠ²Ρ Π½Π°ΡΡΠ°Π²ΡΠ΅ΠΌΠ΅Π½ΠΈΡΠΈΡ
ΡΠ΅Ρ
Π½ΠΈΠΊΠ° ΠΌΠ°ΡΠΈΠ½ΡΠΊΠΎΠ³ ΡΡΠ΅ΡΠ°.This paper analyzes terminology development in power engineering domain using natural language processing methods. The paper is divided into eight chapters and deals with the theory of terminology as an academic field in general, with international and domestic institutions involved in terminology development, development of specialized terminology within power engineering domain in Serbian language, the application of corpus linguistics in terminological research, as well as corpus processing tools and language resources.
Parallel corpora are bilingual or multilingual corpora of texts that are very important in linguistic research. The development of a parallel corpus composed of texts in power engineering domain (ElEner) started with the preparation of this doctoral dissertation. The corpus is composed of technical, scientific and legislative texts both in Serbian and English published from 2006 until 2015.
The dissertation thoroughly analyzes the process of text selection and collection, text processing techniques using appropriate language resources and tools for Serbian and English, parallelization of texts, extraction of terminology in Serbian and English, alignment and matching of chunks and terms, and evaluation of obtained results. After the evaluation process is completed, all correctly evaluated pairs are included in the Termi terminology database, which supports the development of terminological dictionaries in various fields (mathematics, computing, mining, librarianship, computational linguistics, power engineering, etc.), as well as processing and presentation of terms in Serbian, English, German and French and their export to various output formats. This database is thus upgraded with new lexical units and synonyms from the power engineering domain in Serbian and English. The obtained list of translation pairs was used for power engineering bilingual dictionary development.
The new aligned ElEner corpus is stored in digital library BibliΕ‘a, which enables multilingual search of large collections of aligned texts. The search of this digital library is performed using lexical resources that enable morphological and semantic expansion of the queries.
The obtained terminological pairs represent the basis for the development of a new modern dictionary in the field of power engineering, and provide an opportunity for the improvement and expansion of the terminology base of Electropedia. The text processing procedure proposed by this dissertation has proven to be applicable and useful for application in other domains as well. In the future research, the goal is to improve the proposed technique by including automatic validation of the obtained bilingual terms of the candidates in this routine, based on the state-of-the-art machine learning techniques