Search CORE

3 research outputs found

A hybrid approach for automatic extraction of bilingual Multiword Expressions from parallel corpora

Author: Semmar Nasredine
Publication venue: HAL CCSD
Publication date: 07/05/2018
Field of study

International audienceSpecific-domain bilingual lexicons play an important role for domain adaptation in machine translation. The entries of these types of lexicons are mostly composed of MultiWord Expressions (MWEs). The manual construction of MWEs bilingual lexicons is costly and time-consuming. We often use word alignment approaches to automatically construct bilingual lexicons of MWEs from parallel corpora. We present in this paper a hybrid approach to extract and align MWEs from parallel corpora in a one-step process. We formalize the alignment process as an integer linear programming problem in order to find an approximated optimal solution. This process generates lists of MWEs with their translations, which are then filtered using linguistic patterns for the construction of the bilingual lexicons of MWEs. We evaluate the bilingual lexicons of MWEs produced by this approach using two methods: a manual evaluation of the alignment quality and an evaluation of the impact of this alignment on the translation quality of the phrase-based statistical machine translation system Moses. We experimentally show that the integration of the bilingual MWEs and their linguistic information into the translation model improves the performance of Moses

HAL-CEA

Uticaj klasifikacije teksta na primene u obradi prirodnih jezika

Author: Šandrih Branislava
Publication venue: Универзитет у Београду, Математички факултет
Publication date: 08/07/2020
Field of study

The main goal of this dissertation is to put different text classification tasks in the same frame, by mapping the input data into the common vector space of linguistic attributes. Subsequently, several classification problems of great importance for natural language processing are solved by applying the appropriate classification algorithms. The dissertation deals with the problem of validation of bilingual translation pairs, so that the final goal is to construct a classifier which provides a substitute for human evaluation and which decides whether the pair is a proper translation between the appropriate languages by means of applying a variety of linguistic information and methods. In dictionaries it is useful to have a sentence that demonstrates use for a particular dictionary entry. This task is called the classification of good dictionary examples. In this thesis, a method is developed which automatically estimates whether an example is good or bad for a specific dictionary entry. Two cases of short message classification are also discussed in this dissertation. In the first case, classes are the authors of the messages, and the task is to assign each message to its author from that fixed set. This task is called authorship identification. The other observed classification of short messages is called opinion mining, or sentiment analysis. Starting from the assumption that a short message carries a positive or negative attitude about a thing, or is purely informative, classes can be: positive, negative and neutral. These tasks are of great importance in the field of natural language processing and the proposed solutions are language-independent, based on machine learning methods: support vector machines, decision trees and gradient boosting. For all of these tasks, a demonstration of the effectiveness of the proposed methods is shown on for the Serbian language.Osnovni cilj disertacije je stavljanje različitih zadataka klasifikacije teksta u isti okvir, preslikavanjem ulaznih podataka u isti vektorski prostor lingvističkih atributa..

National Repository of Dissertations in Serbia (NaRDuS)

Nardus

Terminology development in power engineering based on natural language processing methods Развитие терминологии в энергетике на основе методов обработки естественного языка

Author: Ivanović Tanja D.
Publication venue: Универзитет у Београду, Филолошки факултет
Publication date: 09/03/2022
Field of study

У овом раду анализира се развој терминологије из области електроенергетике применом метода обраде природних језика. Рад је подељен на осам поглавља и обрађује општу теорију терминологије као научног домена, међународне и домаће институције које учествују у њеном креирању, развој специјализоване терминологије на српском језику, примену корпусне лингвистике у терминолошким истраживањима, као и корпусне алате и језичке ресурсе који се примењују приликом обраде текстова корпуса. Паралелни корпуси представљају двојезичне односно вишејезичне корпусе текстова који су веома значајни у лингвистичким истраживањима. Развој паралелног корпуса текстова из домена електроенергетике (ElEner) започет је упоредо са израдом ове докторске дисертације. У поступку њене израде, анализирано је 76 докумената насталим у периоду од 2005. до 2016. године, који и чине овај корпус. Реч је текстовима законодавне, техничке и научне природе на српском и енглеском језику. У дисертацији је темељно анализиран процес одабира и прикупљања материјала за корпус, обрада текстова применом одговарајућих језичких ресурса и алата за српски и енглески језик, паралелизација текстова, екстракција терминологије на српском и енглеском језику, поравнавање и упаривање комада и термина, као и евалуација резултата добијених термина и терминолошких парова. Након завршеног процеса евалуације, сви исправно евалуирани парови су укључени у терминолошку базу података Termi, која подржава развој терминолошких речника у различитим областима (математика, рачунарство, рударство, библиотекарство, рачунарска лингвистика, електроенергетика, итд.), као и обраду и презентацију термина на српском, енглеском, немачком и француском језику, и извоз у различите излазне формате. Ова база је тако допуњена новим лексичким јединицама из домена електроенергетике на српском и енглеском језику, као и њиховим синонимима. Добијена листа преводних парова послужила је за генерисање двојезичног речника из домена електроенергетике. Произведени паралелни корпус ElEner смештен је у дигиталну библиотеку Библиша која омогућава вишејезичко претраживање великих колекција поравнатих текстова. Претраживање ове дигиталне библиотеке обавља се помоћу лексичких ресурса који омогућавају морфолошко и семантичко проширење постављених упита. Добијени терминолошки парови представљају основу за развој новог модерног речника из области електроенергетике, чиме се уједно отвара могућност и за унапређење и проширивање терминолошке базе Електропедија. Поступак обраде текстова предложен овом дисертацијом показао се применљивим и корисним и у другим доменима. У истраживањима која ће уследити, циљ је да се побољша предложена техника укључивањем аутоматске валидације добијених двојезичних термина кандидата у постојећу процедуру, на основу најсавременијих техника машинског учења.This paper analyzes terminology development in power engineering domain using natural language processing methods. The paper is divided into eight chapters and deals with the theory of terminology as an academic field in general, with international and domestic institutions involved in terminology development, development of specialized terminology within power engineering domain in Serbian language, the application of corpus linguistics in terminological research, as well as corpus processing tools and language resources. Parallel corpora are bilingual or multilingual corpora of texts that are very important in linguistic research. The development of a parallel corpus composed of texts in power engineering domain (ElEner) started with the preparation of this doctoral dissertation. The corpus is composed of technical, scientific and legislative texts both in Serbian and English published from 2006 until 2015. The dissertation thoroughly analyzes the process of text selection and collection, text processing techniques using appropriate language resources and tools for Serbian and English, parallelization of texts, extraction of terminology in Serbian and English, alignment and matching of chunks and terms, and evaluation of obtained results. After the evaluation process is completed, all correctly evaluated pairs are included in the Termi terminology database, which supports the development of terminological dictionaries in various fields (mathematics, computing, mining, librarianship, computational linguistics, power engineering, etc.), as well as processing and presentation of terms in Serbian, English, German and French and their export to various output formats. This database is thus upgraded with new lexical units and synonyms from the power engineering domain in Serbian and English. The obtained list of translation pairs was used for power engineering bilingual dictionary development. The new aligned ElEner corpus is stored in digital library Bibliša, which enables multilingual search of large collections of aligned texts. The search of this digital library is performed using lexical resources that enable morphological and semantic expansion of the queries. The obtained terminological pairs represent the basis for the development of a new modern dictionary in the field of power engineering, and provide an opportunity for the improvement and expansion of the terminology base of Electropedia. The text processing procedure proposed by this dissertation has proven to be applicable and useful for application in other domains as well. In the future research, the goal is to improve the proposed technique by including automatic validation of the obtained bilingual terms of the candidates in this routine, based on the state-of-the-art machine learning techniques

Nardus