    The aim of this research paper is to conduct a thorough analysis of inter-annotator agreement in the process of error analysis, which is well-known for its subjectivity and low level of agreement. Since the process is tiresome in its nature and the available user interfaces are pretty distinct from what the average annotator is accustomed to, a user-friendly Windows 10 application offering a more attractive user interface is developed with the aim to simplify the process of error analysis. Translations are performed with Google Translate engine and English-Croatian is selected as the language pair. Since there has been a lot of dispute on inter-annotator agreement and the need for guidelines has been often been emphasized as crucial, the annotators are given a very detailed introduction into the process of error analysis itself. They are given a presentation with a list of the MQM guidelines enriched with tricky cases. All annotators are native speakers of Croatian as the target language and have a linguistic background. The results demonstrate that a stronger agreement indicates more similar backgrounds and that the task of selecting annotators should be conducted more carefully. Furthermore, a training phase on a similar test set is deemed necessary in order to gain a stronger agreement.Cilj rada je izvrÅ”iti temeljitu analizu slaganja među označivačima u postupku analize pogreÅ”aka koji je poznat po svojoj subjektivnosti i niskoj razini slaganja. Budući da je sam postupak po prirodi zamoran, a sučelja dostupnih alata i usluga poprilično se razlikuju od onog na Å”to je prosječni označivač naviknut, u svrhu pojednostavljenja samog postupka analize pogreÅ”aka razvijena je Windows 10 aplikacija s poznatim i atraktivnim korisničkim sučeljem. Englesko-hrvatski prijevodi preuzeti su s usluge Google Translate. Budući da je slaganje među označivačima čest predmet rasprave i da je od neospornog značaja istaknuta potreba za smjernicama, označivačima je dan vrlo detaljan uvid u postupak analize pogreÅ”aka. Također, popis MQM smjernica uz primjere potencijalnih pogreÅ”aka uobličen je u prezentaciju i dan označivačima na raspolaganje. Označivačima je ciljni, tj. hrvatski jezik materinski, a svi imaju određenu razinu lingvističke pozadine. Rezultati otkrivaju da veća razina slaganja ukazuje na sličnije formalno obrazovanje i da proces odabira označivača treba biti pažljivo osmiÅ”ljen. Å toviÅ”e, testiranje na sličnom skupu podataka trebalo bi prethoditi odabiru označivača kako bi se postigla veća razina slaganja

    Uvid u automatsko izlučivanje metaforičkih kolokacija

    Collocations have been the subject of much scientific research over the years. The focus of this research is on a subset of collocations, namely metaphorical collocations. In metaphorical collocations, a semantic shift has taken place in one of the components, i.e., one of the components takes on a transferred meaning. The main goal of this paper is to review the existing literature and provide a systematic overview of the existing research on collocation extraction, as well as the overview of existing methods, measures, and resources. The existing research is classified according to the approach (statistical, hybrid, and distributional semantics) and presented in three separate sections. The insights gained from existing research serve as a first step in exploring the possibility of developing a method for automatic extraction of metaphorical collocations. The methods, tools, and resources that may prove useful for future work are highlighted.Kolokacije su već dugi niz godina tema mnogih znanstvenih istraživanja. U fokusu ovoga istraživanja podskupina je kolokacija koju čine metaforičke kolokacije. Kod metaforičkih je kolokacija kod jedne od sastavnica doÅ”lo do semantičkoga pomaka, tj. jedna od sastavnica poprima preneseno značenje. Glavni su ciljevi ovoga rada istražiti postojeću literaturu te dati sustavan pregled postojećih istraživanja na temu izlučivanja kolokacija i postojećih metoda, mjera i resursa. Postojeća istraživanja opisana su i klasificirana prema različitim pristupima (statistički, hibridni i zasnovani na distribucijskoj semantici). Također su opisane različite asocijativne mjere i postojeći načini procjene rezultata automatskoga izlučivanja kolokacija. Metode, alati i resursi koji su koriÅ”teni u prethodnim istraživanjima, a mogli bi biti korisni za naÅ” budući rad posebno su istaknuti. Stečeni uvidi u postojeća istraživanja čine prvi korak u razmatranju mogućnosti razvijanja postupka za automatsko izlučivanje metaforičkih kolokacija

    The Official Bilingualism in the Istrian County: State of the Art and Perspectives

    Službena dvojezičnost pretpostavlja svakodnevno stvaranje usporednih tekstova u dvojezičnim područjima. Slučaj je takav i u Istarskoj županiji, u kojoj se tekstovi obično sastavljaju na hrvatskome, a zatim se prevode na talijanski jezik. Zbog činjenice da je riječ o službenim tekstovima i zbog konteksta uporabe talijanskoga jezika vrlo je važno imati precizno i ujednačeno nazivlje te razvijene jezične tehnologije koje bi omogućile brže i kvalitetnije prevođenje u dvojezičnim institucijama u Istarskoj županiji. Ciljevi su ovoga rada: (1) prikazati ostvarivanje ravnopravne službene uporabe talijanskoga kao manjinskoga jezika u Istarskoj županiji (1a) analizom prevedenih sadržaja na službenim mrežnim stranicama službeno dvojezičnih gradova i općina i (1b) prikazom dosadaÅ”nje prakse prevođenja te (2) ukazati na važnost i nužnost razvoja jezičnih tehnologija (2a) prikazom trendova u razvoju jezičnih tehnologija u sličnim dvojezičnim i viÅ”ejezičnim institucijama, (2b) prikazom pripreme usporednoga korpusa administrativnih tekstova Istarske županije i (2c) analizom postojećega nazivlja provedenom na priređenome korpusu. Rezultati analize dostupnosti dvojezičnih sadržaja na mrežnim stranicama te analize nazivlja provedene na usporednome korpusu pokazali su da je nužno razviti i upotrebljavati prevoditeljske alate i jezične izvore prilagođene talijanskomu kao manjinskomu jeziku kako bi se olakÅ”alo i ubrzalo prevođenje, a time i omogućilo uspjeÅ”nije ostvarivanje ravnopravne uporabe talijanskoga kao manjinskoga jezika.Official bilingualism assumes the daily creation of parallel texts in bilingual areas. This is also the case in Istria County, where texts are usually written in Croatian and then translated into Italian. Given the official nature of texts and the context of the Italian language usage, it is extremely important to have a precise and uniform terminology, as well as well-developed language technologies, which would allow a faster and more accurate translation processes in bilingual institutions of Croatia. The aims of this paper are (1) to investigate the equal use of both Croatian and Italian as a minority language in the Istrian County through (1a) an analysis of the translated content on the official websites of the officially bilingual cities and municipalities and through (1b) an overview of current translation practices, and (2) to highlight the importance and necessity of the development of language technology through (2a) the presentation of trends in the development of language technology in similar bilingual and multilingual institutions, (2b) the presentation of the parallel corpus compiled from administrative texts of the Istrian County and (2c) the analysis of existing terminology conducted on the compiled corpus. The analysis of the availability of bilingual content on websites and the terminology extracted from the compiled parallel corpus indicates the need to develop and use translation tools and linguistic resources tailored to the Italian language as a minority language, in order to facilitate and accelerate the translation activity, thus allowing a more efficient use of Italian as a minority language

    A general framework for detecting metaphorical collocations

    This paper aims at identifying a specific set of collocations known under the term metaphorical collocations. In this type of collocations, a semantic shift has taken place in one of the components. Since the appropriate gold standard needs to be compiled prior to any serious endeavour to extract metaphorical collocations automatically, this paper first presents the steps taken to compile it, and then establishes appropriate evaluation framework. The process of compiling the gold standard is illustrated on one of the most frequent Croatian nouns, which resulted in the preliminary relation significance set. With the aim to investigate the possibility of facilitating the process, frequency, logDice, relation, and pretrained word embeddings are used as features in the classification task conducted on the logDice-based word sketch relation lists. Preliminary results are presented

    Neural machine translation for translating into Croatian and Serbian

    In this work, we systematically investigate different set-ups for training of neural machine translation (NMT) systems for translation into Croatian and Serbian, two closely related South Slavic languages. We explore English and German as source languages, different sizes and types of training corpora, as well as bilingual and multilingual systems. We also explore translation of English IMDb user movie reviews, a domain/genre where only monolingual data are available. First, our results confirm that multilingual systems with joint target languages perform better. Furthermore, translation performance from English is much better than from German, partly because German is morphologically more complex and partly because the corpus consists mostly of parallel human translations instead of original text and its human translation. The translation from German should be further investigated systematically. For translating user reviews, creating synthetic in-domain parallel data through back- and forward-translation and adding them to a small out-of-domain parallel corpus can yield performance comparable with a system trained on a full out-of-domain corpus. However, it is still not clear what is the optimal size of synthetic in-domain data, especially for forward-translated data where the target language is machine translated. More detailed research including manual evaluation and analysis is needed in this direction

    On machine translation of user reviews

    This work investigates neural machine translation (NMT) systems for translating English user reviews into Croatian and Serbian, two similar morphologically complex languages. Two types of reviews are used for testing the systems: IMDb movie reviews and Amazon product reviews. Two types of training data are explored: large out-of-domain bilingual parallel corpora, as well as small synthetic in-domain parallel corpus obtained by machine translation of monolingual English Amazon reviews into the target languages. Both automatic scores and human evaluation show that using the synthetic in-domain corpus together with a selected subset of out-of-domain data is the best option. Separated results on IMDb and Amazon reviews indicate that MT systems perform differently on different review types so that user reviews generally should not be considered as a homogeneous genre. Nevertheless, more detailed research on larger amount of different reviews covering different domains/topics is needed to fully understand these differences

    Utilization of Explainable Machine Learning Algorithms for Determination of Important Features in ā€˜Suncrestā€™ Peach Maturity Prediction

    Peaches (Prunus persica (L.) Batsch) are a popular fruit in Europe and Croatia. Maturity at harvest has a crucial influence on peach fruit quality, storage life, and consequently consumer acceptance. The main goal of this study is to develop a machine learning model that will detect the most important features for predicting peach maturity by first training models and then using the importance ratings of these models to detect nonlinear (and linear) relationships. Thus, the most important peach features at a given stage of its ripening could be revealed. To date, this method has not been used for this purpose, and at the same time, it has the potential to be applied to other similar peach varieties. A total of 33 fruit features are measured on the harvested peaches, and three imbalanced datasets are created using firmness thresholds of 1.84, 3.57, and 4.59 kgĀ·cmāˆ’2. These datasets are balanced using the SMOTE and ROSE techniques, and the Random Forest machine learning model is trained on them. Permutation Feature Importance (PFI), Variable Importance (VI), and LIME interpretability methods are used to detect variables that most influence predictions in the given machine learning models. PFI shows that the hĀ° and a* ground color parameters, COL ground color index, SSC/TA, and TA inner quality parameters are among the top ten most contributing variables in all three models. Meanwhile, VI shows that this is the case for the a* ground color parameter, COL and CCL ground color indexes, and the SSC/TA inner quality parameter. The fruit flesh ratio is highly positioned (among the top three according to PFI) in two models, but it is not even among the top ten in the third

    Assessment of Various Machine Learning Models for Peach Maturity Prediction Using Non-Destructive Sensor Data

    To date, many machine learning models have been used for peach maturity prediction using non-destructive data, but no performance comparison of the models on these datasets has been conducted. In this study, eight machine learning models were trained on a dataset containing data from 180 ‘Suncrest’ peaches. Before the models were trained, the dataset was subjected to dimensionality reduction using the least absolute shrinkage and selection operator (LASSO) regularization, and 8 input variables (out of 29) were chosen. At the same time, a subgroup consisting of the peach ground color measurements was singled out by dividing the set of variables into three subgroups and by using group LASSO regularization. This type of variable subgroup selection provided valuable information on the contribution of specific groups of peach traits to the maturity prediction. The area under the receiver operating characteristic curve (AUC) values of the selected models were compared, and the artificial neural network (ANN) model achieved the best performance, with an average AUC of 0.782. The second-best machine learning model was linear discriminant analysis with an AUC of 0.766, followed by logistic regression, gradient boosting machine, random forest, support vector machines, a classification and regression trees model, and k-nearest neighbors. Although the primary parameter used to determine the performance of the model was AUC, accuracy, F1 score, and kappa served as control parameters and ultimately confirmed the obtained results. By outperforming other models, ANN proved to be the most accurate model for peach maturity prediction on the given dataset

