10 research outputs found
Authorship Identification in Bengali Literature: a Comparative Analysis
Stylometry is the study of the unique linguistic styles and writing behaviors
of individuals. It belongs to the core task of text categorization like
authorship identification, plagiarism detection etc. Though reasonable number
of studies have been conducted in English language, no major work has been done
so far in Bengali. In this work, We will present a demonstration of authorship
identification of the documents written in Bengali. We adopt a set of
fine-grained stylistic features for the analysis of the text and use them to
develop two different models: statistical similarity model consisting of three
measures and their combination, and machine learning model with Decision Tree,
Neural Network and SVM. Experimental results show that SVM outperforms other
state-of-the-art methods after 10-fold cross validations. We also validate the
relative importance of each stylistic feature to show that some of them remain
consistently significant in every model used in this experiment.Comment: 9 pages, 5 tables, 4 picture
Tune your brown clustering, please
Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
EVALITA Evaluation of NLP and Speech Tools for Italian Proceedings of the Final Workshop
Editor of the proceedings of EVALITA 2016
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
CLARIN. The infrastructure for language resources
CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future.
The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
Uticaj klasifikacije teksta na primene u obradi prirodnih jezika
The main goal of this dissertation is to put different text classification tasks in
the same frame, by mapping the input data into the common vector space of linguistic
attributes. Subsequently, several classification problems of great importance for natural
language processing are solved by applying the appropriate classification algorithms.
The dissertation deals with the problem of validation of bilingual translation pairs, so
that the final goal is to construct a classifier which provides a substitute for human evaluation
and which decides whether the pair is a proper translation between the appropriate
languages by means of applying a variety of linguistic information and methods.
In dictionaries it is useful to have a sentence that demonstrates use for a particular dictionary
entry. This task is called the classification of good dictionary examples. In this thesis,
a method is developed which automatically estimates whether an example is good or bad
for a specific dictionary entry.
Two cases of short message classification are also discussed in this dissertation. In the
first case, classes are the authors of the messages, and the task is to assign each message
to its author from that fixed set. This task is called authorship identification. The other
observed classification of short messages is called opinion mining, or sentiment analysis.
Starting from the assumption that a short message carries a positive or negative attitude
about a thing, or is purely informative, classes can be: positive, negative and neutral.
These tasks are of great importance in the field of natural language processing and the
proposed solutions are language-independent, based on machine learning methods: support
vector machines, decision trees and gradient boosting. For all of these tasks, a
demonstration of the effectiveness of the proposed methods is shown on for the Serbian
language.Osnovni cilj disertacije je stavljanje različitih zadataka klasifikacije teksta u
isti okvir, preslikavanjem ulaznih podataka u isti vektorski prostor lingvističkih atributa..
Adaptive Reuse
The present volume explores a specific aspect of creativity in South Asian systems of knowledge, literature and rituals. Under the heading of “adaptive reuse,” it discusses the relationship between innovation and perpetuation of earlier forms and contents of knowledge and aesthetic expressions within the process of creating new works. Although this relation rarely became the topic of explicit reflections in the South Asian intellectual traditions, it is here investigated by taking a closer look at the treatment of older materials by later authors."Adaptive Reuse" ist ein wichtiges theoretisches Konzept aus dem Bereich der Architektur. Dort bezeichnete es die Verwendung eines teilweise umgebauten Gebäudes zu andern Zwecken als denen seiner ursprünglichen Errichtung. Im vorliegenden Band wird dieses Konzept zum ersten Mal auf ein weiteres Spektrum kulturellen Schaffens übertragen, nämlich auf die Komposition von Texten und auf die Kreation neuer Konzepte und Ritual