2,182 research outputs found
Automatic categorization of Ottoman poems
Cataloged from PDF version of article.This work is partially supported by the Scientific and Technical Research Council of Turkey (TÜBİTAK) under the grant number 109E006.Authorship attribution and identifying time period of literary works are fundamental problems
in quantitative analysis of languages. We investigate two fundamentally different machine learning text
categorization methods, Support Vector Machines (SVM) and Naïve Bayes (NB), and several style
markers in the categorization of Ottoman poems according to their poets and time periods. We use the
collected works (divans) of ten different Ottoman poets: two poets from each of the five different
hundred-year periods ranging from the 15th to 19 th century. Our experimental evaluation and statistical
assessments show that it is possible to obtain highly accurate and reliable classifications and to
distinguish the methods and style markers in terms of their effectiveness
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Developing a text categorization template for Turkish news portals
In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted difficult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. In this study we aim to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. We also examine some other aspects such as the effects of training dataset set size and robustness issues. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Our results recommends a text categorization template for Turkish news portals and provides some future research pointers. © 2011 IEEE
Text categorization and ensemble pruning in Turkish news portals
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2011.Thesis (Master's) -- Bilkent University, 2011.Includes bibliographical references leaves 53-60.In news portals, text category information is needed for news presentation. However,
for many news stories the category information is unavailable, incorrectly
assigned or too generic. This makes the text categorization a necessary tool
for news portals. Automated text categorization (ATC) is a multifaceted diffi-
cult process that involves decisions regarding tuning of several parameters, term
weighting, word stemming, word stopping, and feature selection. It is important
to find a categorization setup that will provide highly accurate results in ATC for
Turkish news portals. Two Turkish test collections with different characteristics
are created using Bilkent News Portal. Experiments are conducted with four classification
methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and
rbf kernels). Results recommend a text categorization template for Turkish news
portals. Regarding recommended text categorization template, ensemble learning
methods are applied to increase effectiveness. Since they require many computational
workload, ensemble pruning strategies are developed. Data partitioning
ensembles are constructed and ranked-based ensemble pruning is applied with
several machine learning categorization algorithms. The aim is to answer the following
questions: (1) How much data can we prune using data partitioning on the
text categorization domain? (2) Which partitioning and categorization methods
are more suitable for ensemble pruning? (3) How do English and Turkish differ
in ensemble pruning? (4) Can we increase effectiveness with ensemble pruning
in the text categorization? Experiments are conducted on two text collections:
Reuters-21578 and BilCat-TRT. 90% of ensemble members can be pruned with
almost no decreasing in accuracy.Toraman, ÇağrıM.S
Evaluation and Sociolinguistic Analysis of Text Features for Gender and Age Identification
The paper presents an interdisciplinary study in the field of automatic gender and age identification, under the scope of sociolinguistic knowledge on gendered and age linguistic choices that social media users make. The authors investigated and gathered standard and novel text features used in text mining approaches on the author's demographic information and profiling and they examined their efficacy in gender and age detection tasks on a corpus consisted of social media texts. An analysis of the most informative features is attempted according to the nature of each feature and the information derived after the characteristics' score of importance is discussed
Deep Learning for Multi-Structured Javanese Gamelan Note Generator
Javanese gamelan, a traditional Indonesian musical style, has several song structures called gendhing. Gendhing (songs) are written in conventional notation and require gamelan musicians to recognize patterns in the structure of each song. Usually, previous research on gendhing focuses on artistic and ethnomusicological perspectives, but this study is to explore the correlation between gendhing as traditional music in Indonesia and deep learning technology that replaces the task of gamelan composers. This research proposes CNN-LSTM to generate notation of ricikan struktural instruments as an accompaniment to Javanese gamelan music compositions based on balungan notation, rhythm, song structure, and gatra information. This proposed method (CNN-LSTM) is compared with LSTM and CNN. The musical data in this study is represented using numerical notation for the main melody in balungan notation. The experimental results showed that the CNN-LSTM model showed better performance compared to the LSTM and CNN models, with accuracy values of 91.9%, 91.5%, and 91.2% for CNN-LSTM, LSTM, and CNN, respectively. And the value of note distance for the Sampak song structure is 4 for the CNN-LSTM model, 8 for the LSTM model, and 12 for the CNN model. The smaller the note distance, the closer it is to the original notation provided by the gamelan composer. This study provides relevance for novice gamelan musicians who are interested in learning karawitan, especially in understanding ricikan struktural music notation and gamelan art in composing musical compositions of a song
- …