17 research outputs found

    Approaching Questions of Text Reuse in Ancient Greek Using Computational Syntactic Stylometry

    Get PDF
    We are investigating methods by which data from dependency syntax treebanks of ancient Greek can be applied to questions of authorship in ancient Greek historiography. From the Ancient Greek Dependency Treebank were constructed syntax words (sWords) by tracing the shortest path from each leaf node to the root for each sentence tree. This paper presents the results of a preliminary test of the usefulness of the sWord as a stylometric discriminator. The sWord data was subjected to clustering analysis. The resultant groupings were in accord with traditional classifications. The use of sWords also allows a more fine-grained heuristic exploration of difficult questions of text reuse. A comparison of relative frequencies of sWords in the directly transmitted Polybius book 1 and the excerpted books 9–10 indicate that the measurements of the two texts are generally very close, but when frequencies do vary, the differences are surprisingly large. These differences reveal that a certain syntactic simplification is a salient characteristic of Polybius’ excerptor, who leaves conspicuous syntactic indicators of his modifications

    Approaching Questions of Text Reuse in Ancient Greek Using Computational Syntactic Stylometry

    Get PDF
    We are investigating methods by which data from dependency syntax treebanks of ancient Greek can be applied to questions of authorship in ancient Greek historiography. From the Ancient Greek Dependency Treebank were constructed syntax words (sWords) by tracing the shortest path from each leaf node to the root for each sentence tree. This paper presents the results of a preliminary test of the usefulness of the sWord as a stylometric discriminator. The sWord data was subjected to clustering analysis. The resultant groupings were in accord with traditional classifications. The use of sWords also allows a more fine-grained heuristic exploration of difficult questions of text reuse. A comparison of relative frequencies of sWords in the directly transmitted Polybius book 1 and the excerpted books 9–10 indicate that the measurements of the two texts are generally very close, but when frequencies do vary, the differences are surprisingly large. These differences reveal that a certain syntactic simplification is a salient characteristic of Polybius’ excerptor, who leaves conspicuous syntactic indicators of his modifications

    ΠŸΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ½ΠΎ-алгоритмичСскоС обСспСчСниС создания синтаксичСско-статистичСской ΠΌΠΎΠ΄Π΅Π»ΠΈ русского языка ΠΏΠΎ тСкстовому корпусу

    Get PDF
    Creation of the language model is one of the stages of training of a continuous speech recognition system. In the paper, the developed software for creation of syntactic-statistical Russian language model based on a text corpus is described. The main stages of the algorithm are preliminary text material processing, creation of statistical n-gram language model, extension of the statistical model by n-grams obtained by syntactical analysis. Syntactical analysis permits to increase the quantity of different bigrams created during text processing and to improve the quality of the language model by extracting grammatically-connected word pairs. The results of the testing of the language models created with the help of the software module are presented.Π‘ΠΎΠ·Π΄Π°Π½ΠΈΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ языка являСтся ΠΎΠ΄Π½ΠΈΠΌ ΠΈΠ· этапов обучСния систСмы распознавания слитной Ρ€Π΅Ρ‡ΠΈ. Π’ ΡΡ‚Π°Ρ‚ΡŒΠ΅ описаны Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌ ΠΈ Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Π°Π½Π½Ρ‹Π΅ ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ½Ρ‹Π΅ срСдства для создания синтаксичСско-статистичСской ΠΌΠΎΠ΄Π΅Π»ΠΈ русского языка ΠΏΠΎ тСкстовому корпусу. ΠžΡΠ½ΠΎΠ²Π½Ρ‹ΠΌΠΈ этапами Π² Ρ€Π°Π±ΠΎΡ‚Π΅ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ° ΡΠ²Π»ΡΡŽΡ‚ΡΡ ΠΏΡ€Π΅Π΄Π²Π°Ρ€ΠΈΡ‚Π΅Π»ΡŒΠ½Π°Ρ ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠ° тСкстового ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΠ°Π»Π°, созданиС статистичСской n-Π³Ρ€Π°ΠΌΠΌΠ½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ языка, Π΄ΠΎΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ статистичСской ΠΌΠΎΠ΄Π΅Π»ΠΈ n-Π³Ρ€Π°ΠΌΠΌΠ°ΠΌΠΈ, ΠΏΠΎΠ»ΡƒΡ‡Π΅Π½Π½Ρ‹ΠΌΠΈ Π² Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Π΅ синтаксичСского Π°Π½Π°Π»ΠΈΠ·Π°. БинтаксичСский Π°Π½Π°Π»ΠΈΠ· позволяСт ΡƒΠ²Π΅Π»ΠΈΡ‡ΠΈΡ‚ΡŒ количСство создаваСмых Π² Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Π΅ ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ тСкста Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Ρ… Π±ΠΈΠ³Ρ€Π°ΠΌΠΌ ΠΈ Ρ‚Π΅ΠΌ самым ΠΏΠΎΠ²Ρ‹ΡΠΈΡ‚ΡŒ качСство ΠΌΠΎΠ΄Π΅Π»ΠΈ языка Π·Π° счСт выявлСния грамматичСски связанных ΠΏΠ°Ρ€ слов. ΠŸΡ€ΠΈΠ²ΠΎΠ΄ΡΡ‚ΡΡ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ тСстирования созданных с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ½ΠΎΠ³ΠΎ модуля ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ языка ΠΏΠΎ показатСлям ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½ΠΎΠΉ энтропии, коэффициСнта нСопрСдСлСнности, ΠΎΡ‚Π½ΠΎΡΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎΠ³ΠΎ количСства внСсловарных слов ΠΈ совпадСний n-Π³Ρ€Π°ΠΌΠΌ

    Hate speech and offensive language detection: a new feature set with filter-embedded combining feature selection

    Get PDF
    Social media has changed the world and play an important role in people lives. Social media platforms like Twitter, Facebook and YouTube create a new dimension of communication by providing channels to express and exchange ideas freely. Although the evolution brings numerous benefits, the dynamic environment and the allowable of anonymous posts could expose the uglier side of humanity. Irresponsible people would abuse the freedom of speech by aggressively express opinion or idea that incites hatred. This study performs hate speech and offensive language detection. The problem of this task is there is no clear boundary between hate speech and offensive language. In this study, a selected new features set is proposed for detecting hate speech and offensive language. Using Twitter dataset, the experiments are performed by considering the combination of word n-gram and enhanced syntactic n-gram. To reduce the feature set, filter-embedded combining feature selection is used. The experimental results indicate that the combination of word n-gram and enhanced syntactic n-gram with feature selection to classify the data into three classes: hate speech, offensive language or neither could give good performance. The result reaches 91% for accuracy and the averages of precision, recall and F1
    corecore