Search CORE

130 research outputs found

Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

Author: Ustaszewski Michael
Publication venue
Publication date: 01/01/2016
Field of study

In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

Recommended from our members

Ensemble methods for instance-based Arabic language authorship attribution

Author: Al-Hadhrami T
Al-Sarem M
Alsaeedi A
Boulila W
Saeed F
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/01/2020
Field of study

The Authorship Attribution (AA) is considered as a subfield of authorship analysis and it is an important problem as the range of anonymous information increased with fast growing of internet usage worldwide. In other languages such as English, Spanish and Chinese, such issue is quite well studied. However, in Arabic language, the AA problem has received less attention from the research community due to complexity and nature of Arabic sentences. The paper presented an intensive review on previous studies for Arabic language. Based on that, this study has employed the Technique for Order Preferences by Similarity to Ideal Solution (TOPSIS) method to choose the base classifier of the ensemble methods. In terms of attribution features, hundreds of stylometric features and distinct words using several tools have been extracted. Then, Adaboost and Bagging ensemble methods have been applied on Arabic enquires (Fatwa) dataset. The findings showed an improvement of the effectiveness of the authorship attribution task in the Arabic language

Nottingham Trent Institutional Repository (IRep)

Automatic Question Generation Using Semantic Role Labeling for Morphologically Rich Languages

Author: Branko Žitko
Daniel Vasić*
Hrvoje Ljubić
Publication venue: 'Mechanical Engineering Faculty in Slavonski Brod'
Publication date: 01/01/2021
Field of study

In this paper, a novel approach to automatic question generation (AQG) using semantic role labeling (SRL) for morphologically rich languages is presented. A model for AQG is developed for our native speaking language, Croatian. Croatian language is a highly inflected language that belongs to Balto-Slavic family of languages. Globally this article can be divided into two stages. In the first stage we present a novel approach to SRL of texts written in Croatian language that uses Conditional Random Fields (CRF). SRL traditionally consists of predicate disambiguation, argument identification and argument classification. After these steps most approaches use beam search to find optimal sequence of arguments based on given predicate. We propose the architecture for predicate identification and argument classification in which finding the best sequence of arguments is handled by Viterbi decoding. We enrich SRL features with custom attributes that are custom made for this language. Our SRL system achieves F1 score of 78% in argument classification step on Croatian hr 500k corpus. In the second stage the proposed SRL model is used to develop AQG system for question generation from texts written in Croatian language. We proposed custom templates for AQG that were used to generate a total of 628 questions which were evaluated by experts scoring every question on a Likert scale. Expert evaluation of the system showed that our AQG achieved good results. The evaluation showed that 68% of the generated questions could be used for educational purposes. With these results the proposed AQG system could be used for possible implementation inside educational systems such as Intelligent Tutoring Systems

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Application of Weighted Voting Taggers to Languages Described with Large Tagsets

Author: Kitowski Jacek
Kuta Marcin
Wojcik Wojciech
Wrzeszcz Michał
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 26/01/2012
Field of study

The paper presents baseline and complex part-of-speech taggers applied to the modified corpus of Frequency Dictionary of Contemporary Polish, annotated with a large tagset. First, the paper examines accuracy of 6 baseline part-of-speech taggers. The main part of the work presents simple weighted voting and complex voting taggers. Special attention is paid to lexical voting methods and issues of ties and fallbacks. TagPair and WPDV voting methods achieve the top accuracy among all considered methods. Error reduction 10.8 % with respect to the best baseline tagger for the large tagset is comparable with other author's results for small tagsets

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

PUBLIC OPINION ANALYSIS BASED ON PROBABILISTIC TOPIC MODELING AND DEEP LEARNING

Author: Ma Baojun
Qian Yu
Wan Yan
Ye Qiongwei
Yuan Hua
Zhang Nan
Publication venue: AIS Electronic Library (AISeL)
Publication date: 27/06/2016
Field of study

With the rapid development of Internet, especially the social media technologies, the public have gradually published their perception of social events online through social media. In Web2.0 era, with the concept of extensive participation of public in social-event-related information sharing, the effective content analysis and better results presentation for these media published online thus possesses significant importance for public opinion analysis and monitoring. In view of this, this paper proposes a novel method for public opinion analysis on social media website. First, the probabilistic topic model of Latent Dirichlet Allocation (LDA) is adopted to extract the public ideas about the distinct topics of certain event, and then the deep learning model named word2vec is used to calculate the emotional intensity for each text. Next, the underlying themes in the whole as well as the events of emotional intensity are investigated, and the variation trend of public’s emotion intensities is tracked based on time series analysis. Finally, the rationality and effectiveness of the method are verified with the analysis of a real case

AIS Electronic Library (AISeL)

Comparison of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis for Documents Clustering

Author: Kitowski Jacek
Kuta Marcin
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 04/02/2015
Field of study

In this paper we compare usefulness of statistical techniques of dimensionality reduction for improving clustering of documents in Polish. We start with partitional and agglomerative algorithms applied to Vector Space Model. Then we investigate two transformations: Latent Semantic Analysis and Probabilistic Latent Semantic Analysis. The obtained results showed advantage of Latent Semantic Analysis technique over probabilistic model. We also analyse time and memory consumption aspects of these transformations and present runtime details for IBM BladeCenter HS21 machine

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams

Author: Kapočiūtė-Dzikienė Jurgita
Utka Andrius
Šarkutė Ligita
Publication venue: 'Vilnius University Press'
Publication date: 01/01/2014
Field of study

In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses. Straipsnyje pristatome Seimo posėdžių stenogramų tekstyną, parengtą specialiu formatu, tinkančiu įvairiems autorystės nustatymo tyrimams. Tekstyną sudaro apie 111 tūkstančių tekstų (24 milijonai žodžių), kurių kiekvienas atitinka vieną parlamentaro pasisakymą eilinės sesijos posėdžio metu bei apima 7 Lietuvos Respublikos Seimo kadencijas: nuo 1990 metų kovo 10 dienos iki 2013 metų gruodžio 23 dienos. Pasisakymų tekstai sugrupuoti pagal autorius į 147 grupes, todėl tinka individualių autorių autorystės nustatymo tyrimams; jie suskirstyti pagal autorių amžiaus grupes, lytį ar politines pažiūras, todėl tinka autorių profilio sudarymo tyrimams. Trumpas tekstas neatskleidžia jo autoriaus kalbėjimo stiliaus, yra daugiaprasmiškas kitų autorių atžvilgiu, todėl į tekstyną įtraukti ne trumpesni nei 100 žodžių tekstai. Kiekvieną autorių atitinkantis tekstų rinkinys turi būti išsamus ir reprezentatyvus, todėl įtraukti autoriai, pasisakę ne mažiau kaip 200 kartų. Visi tekstai automatiškai lemuoti, morfologiškai bei sintaksiškai anotuoti, suskaidyti simbolių n-gramomis, surinkta statistinė informacija. Straipsnyje pademonstruota, kaip sukurtas tekstynas gali būti panaudotas individualių autorių autorystės nustatymo bei autorių profilio sudarymo tyrimams, naudojant prižiūrimo mašininio mokymo metodus. Tekstyno struktūra taip pat leidžia taikyti neprižiūrimo Ligita Šarkutė Viešosios politikos ir administravimo institutas Kauno technologijos universitetas K. Donelaičio g. 20-217 LT-44239 Kaunas, Lietuva El. paštas: [email protected] 28 mašininio mokymo metodus, patogi taisyklinių-loginių metodų kūrimui bei įvairioms lingvistinėms analizėms

Kalbotyra

Crossref

KTUePubl (Repository of Kaunas University of Technology)

Directory of Open Access Journals

“Lituanistika”, International Research Database

Increasing Motivation for Studying New Foreign Languages Using ICT : Introduction to Modern Greek

Author: Dantsuji Masatake
Georgiou Georgios
Tsubota Yasushi
Publication venue: 名古屋学院大学総合研究所
Publication date: 31/03/2013
Field of study

It is generally believed that motivation plays a significant role in foreign language learning; students who feel motivated to learn a foreign language are more likely to become successful learners. In Japan, where foreign language education has been primarily teacher-centered for years, students rarely get the opportunity to practice speaking or listening in class, and usually fare poorly at those skills. To provide an opportunity to increase motivation toward foreign languages, a Greek postgraduate student created teaching materials in PowerPoint and HTML to introduce the Greek language to a class of undergraduate students at Kyoto University. The students were asked to introduce themselves and then record their self-introductions using iPod nano(R) mobile digital devices distributed to them. This paper reports on the class content as well as the results of the questionnaire conducted at the end of the class

Nagoya Gakuin University Repository

Finding Translation Examples for Under-Resourced Language Pairs or for Narrow Domains; the Case for Machine Translation

Author: Dan Tufis
Publication venue: Vladimir Andrunachievici Institute of Mathematics and Computer Science
Publication date: 01/07/2012
Field of study

The cyberspace is populated with valuable information sources, expressed in about 1500 different languages and dialects. Yet, for the vast majority of WEB surfers this wealth of information is practically inaccessible or meaningless. Recent advancements in cross-lingual information retrieval, multilingual summarization, cross-lingual question answering and machine translation promise to narrow the linguistic gaps and lower the communication barriers between humans and/or software agents. Most of these language technologies are based on statistical machine learning techniques which require large volumes of cross lingual data. The most adequate type of cross-lingual data is represented by parallel corpora, collection of reciprocal translations. However, it is not easy to find enough parallel data for any language pair might be of interest. When required parallel data refers to specialized (narrow) domains, the scarcity of data becomes even more acute. Intelligent information extraction techniques from comparable corpora provide one of the possible answers to this lack of translation data

Directory of Open Access Journals