2,182 research outputs found

    Automatic categorization of Ottoman poems

    Get PDF
    Cataloged from PDF version of article.This work is partially supported by the Scientific and Technical Research Council of Turkey (TÜBİTAK) under the grant number 109E006.Authorship attribution and identifying time period of literary works are fundamental problems in quantitative analysis of languages. We investigate two fundamentally different machine learning text categorization methods, Support Vector Machines (SVM) and Naïve Bayes (NB), and several style markers in the categorization of Ottoman poems according to their poets and time periods. We use the collected works (divans) of ten different Ottoman poets: two poets from each of the five different hundred-year periods ranging from the 15th to 19 th century. Our experimental evaluation and statistical assessments show that it is possible to obtain highly accurate and reliable classifications and to distinguish the methods and style markers in terms of their effectiveness

    Computational Sociolinguistics: A Survey

    Get PDF
    Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

    Developing a text categorization template for Turkish news portals

    Get PDF
    In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted difficult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. In this study we aim to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. We also examine some other aspects such as the effects of training dataset set size and robustness issues. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Our results recommends a text categorization template for Turkish news portals and provides some future research pointers. © 2011 IEEE

    Text categorization and ensemble pruning in Turkish news portals

    Get PDF
    Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2011.Thesis (Master's) -- Bilkent University, 2011.Includes bibliographical references leaves 53-60.In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted diffi- cult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. It is important to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Results recommend a text categorization template for Turkish news portals. Regarding recommended text categorization template, ensemble learning methods are applied to increase effectiveness. Since they require many computational workload, ensemble pruning strategies are developed. Data partitioning ensembles are constructed and ranked-based ensemble pruning is applied with several machine learning categorization algorithms. The aim is to answer the following questions: (1) How much data can we prune using data partitioning on the text categorization domain? (2) Which partitioning and categorization methods are more suitable for ensemble pruning? (3) How do English and Turkish differ in ensemble pruning? (4) Can we increase effectiveness with ensemble pruning in the text categorization? Experiments are conducted on two text collections: Reuters-21578 and BilCat-TRT. 90% of ensemble members can be pruned with almost no decreasing in accuracy.Toraman, ÇağrıM.S

    Evaluation and Sociolinguistic Analysis of Text Features for Gender and Age Identification

    Get PDF
    The paper presents an interdisciplinary study in the field of automatic gender and age identification, under the scope of sociolinguistic knowledge on gendered and age linguistic choices that social media users make. The authors investigated and gathered standard and novel text features used in text mining approaches on the author's demographic information and profiling and they examined their efficacy in gender and age detection tasks on a corpus consisted of social media texts. An analysis of the most informative features is attempted according to the nature of each feature and the information derived after the characteristics' score of importance is discussed

    Deep Learning for Multi-Structured Javanese Gamelan Note Generator

    Get PDF
    Javanese gamelan, a traditional Indonesian musical style, has several song structures called gendhing. Gendhing (songs) are written in conventional notation and require gamelan musicians to recognize patterns in the structure of each song. Usually, previous research on gendhing focuses on artistic and ethnomusicological perspectives, but this study is to explore the correlation between gendhing as traditional music in Indonesia and deep learning technology that replaces the task of gamelan composers. This research proposes CNN-LSTM to generate notation of ricikan struktural instruments as an accompaniment to Javanese gamelan music compositions based on balungan notation, rhythm, song structure, and gatra information. This proposed method (CNN-LSTM) is compared with LSTM and CNN. The musical data in this study is represented using numerical notation for the main melody in balungan notation. The experimental results showed that the CNN-LSTM model showed better performance compared to the LSTM and CNN models, with accuracy values of 91.9%, 91.5%, and 91.2% for CNN-LSTM, LSTM, and CNN, respectively. And the value of note distance for the Sampak song structure is 4 for the CNN-LSTM model, 8 for the LSTM model, and 12 for the CNN model. The smaller the note distance, the closer it is to the original notation provided by the gamelan composer. This study provides relevance for novice gamelan musicians who are interested in learning karawitan, especially in understanding ricikan struktural music notation and gamelan art in composing musical compositions of a song
    corecore