    Seimo posėdžių stenogramų tekstynas autorystės nustatymo bei autoriaus profilio sudarymo tyrimams

    In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses. Straipsnyje pristatome Seimo posėdžių stenogramų tekstyną, parengtą specialiu formatu, tinkančiu įvairiems autorystės nustatymo tyrimams. Tekstyną sudaro apie 111 tūkstančių tekstų (24 milijonai žodžių), kurių kiekvienas atitinka vieną parlamentaro pasisakymą eilinės sesijos posėdžio metu bei apima 7 Lietuvos Respublikos Seimo kadencijas: nuo 1990 metų kovo 10 dienos iki 2013 metų gruodžio 23 dienos. Pasisakymų tekstai sugrupuoti pagal autorius į 147 grupes, todėl tinka individualių autorių autorystės nustatymo tyrimams; jie suskirstyti pagal autorių amžiaus grupes, lytį ar politines pažiūras, todėl tinka autorių profilio sudarymo tyrimams. Trumpas tekstas neatskleidžia jo autoriaus kalbėjimo stiliaus, yra daugiaprasmiškas kitų autorių atžvilgiu, todėl į tekstyną įtraukti ne trumpesni nei 100 žodžių tekstai. Kiekvieną autorių atitinkantis tekstų rinkinys turi būti išsamus ir reprezentatyvus, todėl įtraukti autoriai, pasisakę ne mažiau kaip 200 kartų. Visi tekstai automatiškai lemuoti, morfologiškai bei sintaksiškai anotuoti, suskaidyti simbolių n-gramomis, surinkta statistinė informacija. Straipsnyje pademonstruota, kaip sukurtas tekstynas gali būti panaudotas individualių autorių autorystės nustatymo bei autorių profilio sudarymo tyrimams, naudojant prižiūrimo mašininio mokymo metodus. Tekstyno struktūra taip pat leidžia taikyti neprižiūrimo Ligita Šarkutė Viešosios politikos ir administravimo institutas Kauno technologijos universitetas K. Donelaičio g. 20-217 LT-44239 Kaunas, Lietuva El. paštas: [email protected] 28 mašininio mokymo metodus, patogi taisyklinių-loginių metodų kūrimui bei įvairioms lingvistinėms analizėms

    Метод комплексного аналізу диференціації фоностатистичних структур стилів англійської мови

    The method of complex analysis of differentiation of phonostatistical structures of English styles has been developed. The method is based on a combination of the two statistical criteria of hypothesis verification for sample homogeneity: the Student's t-test and the Kolmogorov-Smirnov test. The combination of the criteria secures higher precision of style differentiation. On the basis of the developed method, the statistical model of determining the degree of author's style factor effect has been built. The model enables improving accuracy of style and authorship attribution. The developed method and model have been coded on the Java programming language. In the program, the POST requests of two types such as /process i/process/transcription have been developed. The first request is used when the transcription variant of the researched text is not available, the second – when it is available. The second request makes it possible to reduce the program operating time. An open, cross platform, inbuilt data base H2, written on the programming language Java has been used. The data base H2 supports the SQL and is well integrated with the used framework Spring Boot and doesn't need any additional installations. The reply from the transcription transformation site is stored in the data structure HashMap, built on the principle key-meaning and allows avoiding copies. The greater amount of data is processed, the fewer requests to the Internet are made. This secures program independence and autonomy. The program consumes little time. For testing the program, the texts from the publicist style have been chosen ("Freedom Paper", papers by S. Logan and D. Webster). The essential differences have been established in the nasal, dorsal and velar phoneme groups by the Student's t-test. The essential differences have been established in all eight phoneme groups by the Kolmogorov-Smirnov's test. The statistical model of author-differentiating capability for the fricative phoneme group has been built on the basis of the results obtained by the Student's t-test and by the Kolmogorov-Smirnov's test. The results of the program testing have shown that the method of complex analysis of differentiation of phonostatistical structures of English styles allows minimizing the number of phoneme groups by which the styles are differentiated.Розроблено метод комплексного аналізу диференціації фоностатистичних структур стилів англійської мови. Метод ґрунтується на поєднанні двох статистичних критеріїв перевірки гіпотези на однорідність вибірки: критерію Стьюдента і критерію Колмогорова-Смірнова. Поєднання даних критеріїв забезпечує підвищення ефективності диференціації стилів. На основі розробленого методу, побудовано статистичну модель визначення ступеня дії чинника авторської манери викладу. Модель дає змогу підвищити ефективність стильової та авторської атрибуцій тексту. Розроблені метод і модель програмно реалізовано мовою програмування Java. POST запити двох типiв: /process i /process/transcription. розроблено у програмi. За відсутності транскрипційного варіанта досліджуваного тексту, використовується перший запит, другий – за його наявності. Скоротити час роботи програми дає змогу другий запит. Вбудовану базу даних H2 написано мовою програмування Java. База даних H2 є відкритою, кросплатформною. Вона підтримує мову SQL, має добру інтеграцію із використовуваним фреймворком Spring Boot i не потребує додаткових інсталяцій. У структурі даних HashMap зберігається відповідь із сайту транскрипцiйного перекладу. Структура побудована на принципі ключ-значення i дає змогу уникати дублікатів. Якщо опрацьовується велика кількість даних, то зменшується кількість запитів у мережі Інтернет, що забезпечує незалежність і автономність програми. Малі затрати часу є характерними для роботи програми

    Gender influences in Digital Humanities co-authorship networks

    PURPOSE: This paper presents a co-authorship study of authors who published in Digital Humanities journals and examines the apparent influence of gender, or more specifically, the quantitatively detectable influence of gender in the networks they form. DESIGN/METHODOLOGY/APPROACH: This study applied co-authorship network analysis. Data has been collected from three canonical Digital Humanities journals over 52 years (1966–2017) and analysed. FINDINGS: The results are presented as visualised networks and suggest that female scholars in Digital Humanities play more central roles and act as the main bridges of collaborative networks even though overall female authors are fewer in number than male authors in the network. ORIGINALITY/VALUE: This is the first co-authorship network study in Digital Humanities to examine the role that gender appears to play in these co-authorship networks using statistical analysis and visualisation


    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    CLARIN. The infrastructure for language resources

    CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future. The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)


    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    Uticaj klasifikacije teksta na primene u obradi prirodnih jezika

    The main goal of this dissertation is to put different text classification tasks in the same frame, by mapping the input data into the common vector space of linguistic attributes. Subsequently, several classification problems of great importance for natural language processing are solved by applying the appropriate classification algorithms. The dissertation deals with the problem of validation of bilingual translation pairs, so that the final goal is to construct a classifier which provides a substitute for human evaluation and which decides whether the pair is a proper translation between the appropriate languages by means of applying a variety of linguistic information and methods. In dictionaries it is useful to have a sentence that demonstrates use for a particular dictionary entry. This task is called the classification of good dictionary examples. In this thesis, a method is developed which automatically estimates whether an example is good or bad for a specific dictionary entry. Two cases of short message classification are also discussed in this dissertation. In the first case, classes are the authors of the messages, and the task is to assign each message to its author from that fixed set. This task is called authorship identification. The other observed classification of short messages is called opinion mining, or sentiment analysis. Starting from the assumption that a short message carries a positive or negative attitude about a thing, or is purely informative, classes can be: positive, negative and neutral. These tasks are of great importance in the field of natural language processing and the proposed solutions are language-independent, based on machine learning methods: support vector machines, decision trees and gradient boosting. For all of these tasks, a demonstration of the effectiveness of the proposed methods is shown on for the Serbian language.Osnovni cilj disertacije je stavljanje različitih zadataka klasifikacije teksta u isti okvir, preslikavanjem ulaznih podataka u isti vektorski prostor lingvističkih atributa..

    Visualising the intellectual and social structures of digital humanities using an invisible college model

    This thesis explores the intellectual and social structures of an emerging field, Digital Humanities (DH). After around 70 years of development, DH claims to differentiate itself from the traditional Humanities for its inclusiveness, diversity, and collaboration. However, the ‘big tent’ concept not only limits our understandings of its research structure, but also results in a lack of empirical review and sustainable support. Under this umbrella, whether there are merely fragmented topics, or a consolidated knowledge system is still unknown. This study seeks to answer three research questions: a) Subject: What research topics is the DH subject composed of? b) Scholar: Who has contributed to the development of DH? c) Environment: How diverse are the backgrounds of DH scholars? The Invisible College research model is refined and applied as the methodological framework that produces four visualised networks. As the results show, DH currently contributes more towards the general historical literacy and information science, while longitudinally, it was heavily involved in computational linguistics. Humanistic topics are more popular and central, while technical topics are relatively peripheral and have stronger connections with non-Anglophone communities. DH social networks are at the early stages of development, and the formation is heavily influenced by non-academic and non-intellectual factors, e.g., language, working country, and informal relationships. Although male scholars have dominated the field, female scholars have encouraged more communication and built more collaborations. Despite the growing appeals for more diversity, the level of international collaboration in DH is more extensive than in many other disciplines. These findings can help us gain new understandings on the central and critical questions about DH. To the best of the candidate’s knowledge, this study is the first to investigate the formal and informal structures in DH with a well-grounded research model