17 research outputs found
Approaching Questions of Text Reuse in Ancient Greek Using Computational Syntactic Stylometry
We are investigating methods by which data from dependency syntax treebanks of ancient Greek can be applied to questions of authorship in ancient Greek historiography. From the Ancient Greek Dependency Treebank were constructed syntax words (sWords) by tracing the shortest path from each leaf node to the root for each sentence tree. This paper presents the results of a preliminary test of the usefulness of the sWord as a stylometric discriminator. The sWord data was subjected to clustering analysis. The resultant groupings were in accord with traditional classifications. The use of sWords also allows a more fine-grained heuristic exploration of difficult questions of text reuse. A comparison of relative frequencies of sWords in the directly transmitted Polybius book 1 and the excerpted books 9β10 indicate that the measurements of the two texts are generally very close, but when frequencies do vary, the differences are surprisingly large. These differences reveal that a certain syntactic simplification is a salient characteristic of Polybiusβ excerptor, who leaves conspicuous syntactic indicators of his modifications
Approaching Questions of Text Reuse in Ancient Greek Using Computational Syntactic Stylometry
We are investigating methods by which data from dependency syntax treebanks of ancient Greek can be applied to questions of authorship in ancient Greek historiography. From the Ancient Greek Dependency Treebank were constructed syntax words (sWords) by tracing the shortest path from each leaf node to the root for each sentence tree. This paper presents the results of a preliminary test of the usefulness of the sWord as a stylometric discriminator. The sWord data was subjected to clustering analysis. The resultant groupings were in accord with traditional classifications. The use of sWords also allows a more fine-grained heuristic exploration of difficult questions of text reuse. A comparison of relative frequencies of sWords in the directly transmitted Polybius book 1 and the excerpted books 9β10 indicate that the measurements of the two texts are generally very close, but when frequencies do vary, the differences are surprisingly large. These differences reveal that a certain syntactic simplification is a salient characteristic of Polybiusβ excerptor, who leaves conspicuous syntactic indicators of his modifications
ΠΡΠΎΠ³ΡΠ°ΠΌΠΌΠ½ΠΎ-Π°Π»Π³ΠΎΡΠΈΡΠΌΠΈΡΠ΅ΡΠΊΠΎΠ΅ ΠΎΠ±Π΅ΡΠΏΠ΅ΡΠ΅Π½ΠΈΠ΅ ΡΠΎΠ·Π΄Π°Π½ΠΈΡ ΡΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΎ-ΡΡΠ°ΡΠΈΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΡΡΡΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΠΏΠΎ ΡΠ΅ΠΊΡΡΠΎΠ²ΠΎΠΌΡ ΠΊΠΎΡΠΏΡΡΡ
Creation of the language model is one of the stages of training of a continuous speech recognition system. In the paper, the developed software for creation of syntactic-statistical Russian language model based on a text corpus is described. The main stages of the algorithm are preliminary text material processing, creation of statistical n-gram language model, extension of the statistical model by n-grams obtained by syntactical analysis. Syntactical analysis permits to increase the quantity of different bigrams created during text processing and to improve the quality of the language model by extracting grammatically-connected word pairs. The results of the testing of the language models created with the help of the software module are presented.Π‘ΠΎΠ·Π΄Π°Π½ΠΈΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΡΠ·ΡΠΊΠ° ΡΠ²Π»ΡΠ΅ΡΡΡ ΠΎΠ΄Π½ΠΈΠΌ ΠΈΠ· ΡΡΠ°ΠΏΠΎΠ² ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ ΡΠΈΡΡΠ΅ΠΌΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ»ΠΈΡΠ½ΠΎΠΉ ΡΠ΅ΡΠΈ. Π ΡΡΠ°ΡΡΠ΅ ΠΎΠΏΠΈΡΠ°Π½Ρ Π°Π»Π³ΠΎΡΠΈΡΠΌ ΠΈ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠ°Π½Π½ΡΠ΅ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠ½ΡΠ΅ ΡΡΠ΅Π΄ΡΡΠ²Π° Π΄Π»Ρ ΡΠΎΠ·Π΄Π°Π½ΠΈΡ ΡΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΎ-ΡΡΠ°ΡΠΈΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΡΡΡΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΠΏΠΎ ΡΠ΅ΠΊΡΡΠΎΠ²ΠΎΠΌΡ ΠΊΠΎΡΠΏΡΡΡ. ΠΡΠ½ΠΎΠ²Π½ΡΠΌΠΈ ΡΡΠ°ΠΏΠ°ΠΌΠΈ Π² ΡΠ°Π±ΠΎΡΠ΅ Π°Π»Π³ΠΎΡΠΈΡΠΌΠ° ΡΠ²Π»ΡΡΡΡΡ ΠΏΡΠ΅Π΄Π²Π°ΡΠΈΡΠ΅Π»ΡΠ½Π°Ρ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠ° ΡΠ΅ΠΊΡΡΠΎΠ²ΠΎΠ³ΠΎ ΠΌΠ°ΡΠ΅ΡΠΈΠ°Π»Π°, ΡΠΎΠ·Π΄Π°Π½ΠΈΠ΅ ΡΡΠ°ΡΠΈΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ n-Π³ΡΠ°ΠΌΠΌΠ½ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΡΠ·ΡΠΊΠ°, Π΄ΠΎΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ ΡΡΠ°ΡΠΈΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ n-Π³ΡΠ°ΠΌΠΌΠ°ΠΌΠΈ, ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΡΠΌΠΈ Π² ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ΅ ΡΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π°Π½Π°Π»ΠΈΠ·Π°. Π‘ΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΈΠΉ Π°Π½Π°Π»ΠΈΠ· ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΡΠ²Π΅Π»ΠΈΡΠΈΡΡ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΡΠΎΠ·Π΄Π°Π²Π°Π΅ΠΌΡΡ
Π² ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ΅ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ ΡΠ΅ΠΊΡΡΠ° ΡΠ°Π·Π»ΠΈΡΠ½ΡΡ
Π±ΠΈΠ³ΡΠ°ΠΌΠΌ ΠΈ ΡΠ΅ΠΌ ΡΠ°ΠΌΡΠΌ ΠΏΠΎΠ²ΡΡΠΈΡΡ ΠΊΠ°ΡΠ΅ΡΡΠ²ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΡΠ·ΡΠΊΠ° Π·Π° ΡΡΠ΅Ρ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π³ΡΠ°ΠΌΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΈ ΡΠ²ΡΠ·Π°Π½Π½ΡΡ
ΠΏΠ°Ρ ΡΠ»ΠΎΠ². ΠΡΠΈΠ²ΠΎΠ΄ΡΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΡ ΡΠ΅ΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΡΠΎΠ·Π΄Π°Π½Π½ΡΡ
Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠ½ΠΎΠ³ΠΎ ΠΌΠΎΠ΄ΡΠ»Ρ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ ΡΠ·ΡΠΊΠ° ΠΏΠΎ ΠΏΠΎΠΊΠ°Π·Π°ΡΠ΅Π»ΡΠΌ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΠΎΠΉ ΡΠ½ΡΡΠΎΠΏΠΈΠΈ, ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠ° Π½Π΅ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΠΎΡΡΠΈ, ΠΎΡΠ½ΠΎΡΠΈΡΠ΅Π»ΡΠ½ΠΎΠ³ΠΎ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° Π²Π½Π΅ΡΠ»ΠΎΠ²Π°ΡΠ½ΡΡ
ΡΠ»ΠΎΠ² ΠΈ ΡΠΎΠ²ΠΏΠ°Π΄Π΅Π½ΠΈΠΉ n-Π³ΡΠ°ΠΌΠΌ
Hate speech and offensive language detection: a new feature set with filter-embedded combining feature selection
Social media has changed the world and play an important role in people lives. Social media platforms like Twitter, Facebook and YouTube create a new dimension of communication by providing channels to express and exchange ideas freely. Although the evolution brings numerous benefits, the dynamic environment and the allowable of anonymous posts could expose the uglier side of humanity. Irresponsible people would abuse the freedom of speech by aggressively express opinion or idea that incites hatred. This study performs hate speech and offensive language detection. The problem of this task is there is no clear boundary between hate speech and offensive language. In this study, a selected new features set is proposed for detecting hate speech and offensive language. Using Twitter dataset, the experiments are performed by considering the combination of word n-gram and enhanced syntactic n-gram. To reduce the feature set, filter-embedded combining feature selection is used. The experimental results indicate that the combination of word n-gram and enhanced syntactic n-gram with feature selection to classify the data into three classes: hate speech, offensive language or neither could give good performance. The result reaches 91% for accuracy and the averages of precision, recall and F1