16 research outputs found
Paraphrased plagiarism detection using sentence similarity
The paper describes an approach to plagiarism detection within Plag-EvalRus-2017 competition. Our system leverages deep parsing techniques to be able to detect moderately disguised plagiarism. We participated in the two tracks of the competition: source retrieval (sources detection) and text alignment (paraphrased plagiarism detection). There are various cases of plagiarism presented in datasets of both tracks. They vary by the level of disguise that was used while reusing text. The results show that our method performed quite well for detecting moderately disguised forms of plagiarism
The ParaPlag: Russian dataset for paraphrased plagiarism detection
The paper presents the ParaPlag: a large text dataset in Russian to evaluate and compare quality metrics of different plagiarism detection approaches that deal with big data. The competition PlagEvalRus-2017 aimed to evaluate plagiarism detection methods uses the ParaPlag as a main dataset for source retrieval and text alignment tasks. The ParaPlag is open and available on the Web. We propose a guide for writers who want to contribute to the ParaPlag and extend it. The analysis of text rewrite techniques used by unscrupulous authors is also presented in our research
Paraphrased plagiarism detection using sentence similarity
The paper describes an approach to plagiarism detection within Plag-EvalRus-2017 competition. Our system leverages deep parsing techniques to be able to detect moderately disguised plagiarism. We participated in the two tracks of the competition: source retrieval (sources detection) and text alignment (paraphrased plagiarism detection). There are various cases of plagiarism presented in datasets of both tracks. They vary by the level of disguise that was used while reusing text. The results show that our method performed quite well for detecting moderately disguised forms of plagiarism
The ParaPlag: Russian dataset for paraphrased plagiarism detection
The paper presents the ParaPlag: a large text dataset in Russian to evaluate and compare quality metrics of different plagiarism detection approaches that deal with big data. The competition PlagEvalRus-2017 aimed to evaluate plagiarism detection methods uses the ParaPlag as a main dataset for source retrieval and text alignment tasks. The ParaPlag is open and available on the Web. We propose a guide for writers who want to contribute to the ParaPlag and extend it. The analysis of text rewrite techniques used by unscrupulous authors is also presented in our research
The Hybrid Method for Accurate Patent Classification
This article is dedicated to stacking of two approaches of patent classification. First is based on linguistically-supported k-nearest neighbors algorithm using the method of search for topically similar documents based on a comparison of vectors of lexical descriptors. Second is the word embeddings based fastText, where the sentence (or a document) vector is obtained by averaging the n-gram embeddings, and then a multinomial logistic regression exploits these vectors as features. We show in Russian and English datasets that stacking classifier shows better results compared to single classifiers. Β© 2019, Pleiades Publishing, Ltd
Evaluating host-based intrusion detection on the adfa-wd and adfa-wd: Saa datasets
With the growth of the internet and the development of new technologies also originates advancements in methods of cyber-Attacks such as zero-day and stealth attacks, a more effective method of network safety is essential for network stability for both personal use and businesses. This research paper will assess anomalous patterns of Normal Pattern and Abnormal Pattern comprised of system calls based on the Dynamic-Link Library. The two datasets assessed are designed on the Windows Operating System on a Host-based Intrusion Detection System; comprised of the Australian Defence force Windows Dataset (ADFA-WD) and Australian Defence Force Academy Windows Dataset: Stealth Attacks Addendum (ADFA-WD:SAA). The development of a binary feature space is developed based on the common vulnerabilities and exposures at the time of the creation of the dataset. The data mining techniques implemented are Support Vector Machine classifier with sigmoid and RBF kernels is compared to the Random Forest classifier. Β© 2017 CEUR-WS. All rights reserved
Method for Author Attribution Using Word Embeddings
In this paper we look at a methodology of revealing an unknown documentβs author through the use of extracting the author's characteristics from their writing style The method used explores identifying sources of unknown documents, using a model of distributive semantics to form a set of queries to a search engine. The dataset used is the PAN @ CLEF 2019 shared task on Cross-domain Authorship Attribution are in the following languages: English, French, Italian, and Spanish, each of which contains 5 problematic questions, which gives a total of 20 problematic questions. The problem relates to Natural Language Programming where the process is done through the attribution of the user that can be used to identify an authorβs work. The method explores identifying sources of unknown document, using a model of distributive semantics to form a set of queries to a search engine. The method used to reveal the unknown authors is done through distributional semantics; this is based on the following hypothesis: the linguistic units that are observed in close contexts have similar semantic meaning, in this area when looking at linguistics this is calculated based on the proximity of linguistic elements in terms of semantic load based on their distribution in large textual boxes.Π ΡΡΠΎΠΉ ΡΡΠ°ΡΡΠ΅ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ»ΠΎΠ³ΠΈΡ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ Π°Π²ΡΠΎΡΠ° ΡΠ΅ΠΊΡΡΠ° Ρ ΠΏΠΎΠΌΠΎΡΡΡ Π°Π½Π°Π»ΠΈΠ·Π° ΡΡΠΈΠ»Ρ ΠΏΠΈΡΡΠΌΠ° ΠΈ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΠΎΡΠΎΠ±Π΅Π½Π½ΠΎΡΡΠ΅ΠΉ, Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠ½ΡΡ
Π΄Π»Ρ ΠΊΠΎΠ½ΠΊΡΠ΅ΡΠ½ΠΎΠ³ΠΎ Π°Π²ΡΠΎΡΠ°. ΠΠ°Π½Π½ΡΠΉ ΠΌΠ΅ΡΠΎΠ΄ ΠΈΡΡΠ»Π΅Π΄ΡΠ΅Ρ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΠΈ ΠΈΠ΄Π΅Π½ΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΎΠ² Π°Π½Π°Π»ΠΈΠ·ΠΈΡΡΠ΅ΠΌΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΠΌΠΎΠ΄Π΅Π»ΠΈ Π΄ΠΈΡΡΡΠΈΠ±ΡΡΠΈΠ²Π½ΠΎΠΉ ΡΠ΅ΠΌΠ°Π½ΡΠΈΠΊΠΈ Π΄Π»Ρ ΡΠΎΡΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΡ Π½Π°Π±ΠΎΡΠ° Π·Π°ΠΏΡΠΎΡΠΎΠ² Π΄Π»Ρ ΠΏΠΎΠΈΡΠΊΠΎΠ²ΠΎΠΉ ΠΌΠ°ΡΠΈΠ½Ρ. ΠΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌΡΠΉ Π½Π°Π±ΠΎΡ Π΄Π°Π½Π½ΡΡ
ΡΠ²Π»ΡΠ΅ΡΡΡ ΡΠΎΠ²ΠΌΠ΅ΡΡΠ½ΠΎΠΉ Π·Π°Π΄Π°ΡΠ΅ΠΉ PAN @ CLEF 2019 Π² ΠΡΠΎΡΡ-Π΄ΠΎΠΌΠ΅Π½Π½ΠΎΠΉ ΠΡΡΠΈΠ±ΡΡΠΈΠΈ ΠΠ²ΡΠΎΡΡΠΊΠΈΡ
ΠΏΡΠ°Π² Π½Π° ΡΠ°ΠΊΠΈΡ
ΡΠ·ΡΠΊΠ°Ρ
ΠΊΠ°ΠΊ Π°Π½Π³Π»ΠΈΠΉΡΠΊΠΈΠΉ, ΡΡΠ°Π½ΡΡΠ·ΡΠΊΠΈΠΉ, ΠΈΡΠ°Π»ΡΡΠ½ΡΠΊΠΈΠΉ ΠΈ ΠΈΡΠΏΠ°Π½ΡΠΊΠΈΠΉ, ΠΊΠ°ΠΆΠ΄ΡΠΉ ΠΈΠ· ΠΊΠΎΡΠΎΡΡΡ
ΠΈΠΌΠ΅Π΅Ρ 5 Π·Π°Π΄Π°Ρ, ΡΡΠΎ Π² ΡΠΎΠ²ΠΎΠΊΡΠΏΠ½ΠΎΡΡΠΈ ΡΡΠ°Π²ΠΈΡ 20 Π·Π°Π΄Π°Ρ. ΠΠ±ΡΠ°Ρ Π·Π°Π΄Π°ΡΠ°, ΠΎΠ±ΡΠ΅Π΄ΠΈΠ½ΡΡΡΠ°Ρ ΡΡΠΈ 20 Π·Π°Π΄Π°Ρ, ΡΠ²ΡΠ·Π°Π½Π° Ρ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π½Π° Π΅ΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠΌ ΡΠ·ΡΠΊΠ΅, Π² ΡΠ°ΠΌΠΊΠ°Ρ
ΠΊΠΎΡΠΎΡΠΎΠ³ΠΎ Π΄Π°Π½Π½ΡΠΉ ΠΏΡΠΎΡΠ΅ΡΡ ΠΎΡΡΡΠ΅ΡΡΠ²Π»ΡΠ΅ΡΡΡ ΡΠ΅ΡΠ΅Π· Π°ΡΡΠΈΠ±ΡΡΠΈΡ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Ρ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½Π° Π΄Π»Ρ ΠΈΠ΄Π΅Π½ΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΡΠ°Π±ΠΎΡΡ Π°Π²ΡΠΎΡΠ°. ΠΡΠΈΠ²Π΅Π΄Π΅Π½Π½ΡΠΉ Π·Π΄Π΅ΡΡ ΠΌΠ΅ΡΠΎΠ΄ ΠΈΡΡΠ»Π΅Π΄ΡΠ΅Ρ Π²ΡΡΠ²Π»Π΅Π½ΠΈΠ΅ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΎΠ² Π½Π΅ΠΈΠ·Π²Π΅ΡΡΠ½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ°, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ ΠΌΠΎΠ΄Π΅Π»Ρ Π΄ΠΈΡΡΡΠΈΠ±ΡΡΠΈΠ²Π½ΠΎΠΉ ΡΠ΅ΠΌΠ°Π½ΡΠΈΠΊΠΈ Π΄Π»Ρ ΡΠΎΡΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΡ Π½Π°Π±ΠΎΡΠ° Π·Π°ΠΏΡΠΎΡΠΎΠ² ΠΊ ΠΏΠΎΠΈΡΠΊΠΎΠ²ΠΎΠΉ ΡΠΈΡΡΠ΅ΠΌΠ΅. ΠΠ΅ΡΠΎΠ΄, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌΡΠΉ Π΄Π»Ρ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π½Π΅ΠΈΠ·Π²Π΅ΡΡΠ½ΡΡ
Π°Π²ΡΠΎΡΠΎΠ², Π±Π°Π·ΠΈΡΡΠ΅ΡΡΡ Π½Π° Π΄ΠΈΡΡΡΠΈΠ±ΡΡΠΈΠ²Π½ΠΎΠΉ ΡΠ΅ΠΌΠ°Π½ΡΠΈΠΊΠ΅ ΠΈ Π½Π° ΡΠ»Π΅Π΄ΡΡΡΠ΅ΠΉ Π³ΠΈΠΏΠΎΡΠ΅Π·Π΅: Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ Π΅Π΄ΠΈΠ½ΠΈΡΡ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΏΡΠΈΡΡΡΡΡΠ²ΡΡΡ Π² ΡΡ
ΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ½ΡΠ΅ΠΊΡΡΠ°Ρ
, ΠΈΠΌΠ΅ΡΡ ΡΡ
ΠΎΠ΄Π½ΠΎΠ΅ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅. ΠΠ½Π°Π»ΠΈΠ·ΠΈΡΡΠ΅ΠΌΡΠ΅ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ Π΅Π΄ΠΈΠ½ΠΈΡΡ ΡΠ°ΡΡΡΠΈΡΡΠ²Π°ΡΡΡΡ, ΠΈΡΡ
ΠΎΠ΄Ρ ΠΈΠ· Π±Π»ΠΈΠ·ΠΎΡΡΠΈ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΡΠ»Π΅ΠΌΠ΅Π½ΡΠΎΠ² Ρ ΡΠΎΡΠΊΠΈ Π·ΡΠ΅Π½ΠΈΡ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ Π½Π°Π³ΡΡΠ·ΠΊΠΈ, ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΠΎΠΉ Π½Π° ΠΈΡ
ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΠΈ Π² Π±ΠΎΠ»ΡΡΠΈΡ
ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
ΠΎΡΡΡΠ²ΠΊΠ°Ρ
ΠΡΠ΅Π½ΠΊΠ° ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΠΎΡΡΠΈ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ Ρ Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠΈ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ Π·Π½Π°ΡΠΈΠΌΠΎΡΡΠΈ ΠΏΡΠΈ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΏΠΎΡΠΎΠΊΠ° Π½ΠΎΠ²ΠΎΡΡΠ½ΡΡ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠΉ
The paper presents an approach for ranking the most valuable features for text classification task. The introduced Topical Importance Characteristic leverages the feature selection method comprising the information about the distributions of words or phrases among the topics. We compare this method to well-known TF-IDF approach and use the introduced word-ranking scheme in two classifiers: Random Forrest and Multinomial NaΓ―ve Bayes. The Accuracy of classification results was tested in the β20-Newsgroupsβ dataset. The developed approach outperforms TF-IDF-based methods and matches the Accuracy achieved by the more powerful state of the art approaches such as SVC on the same dataset.Π‘ΡΠ°ΡΡΡ ΠΏΠΎΡΠ²ΡΡΠ΅Π½Π° ΠΎΡΠ΅Π½ΠΊΠ΅ ΠΊΠ°ΡΠ΅ΡΡΠ²Π° Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΈΡ
ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π½ΠΎΠ²ΠΎΡΡΠ½ΡΡ
ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠΉ. Π Π΅Π°Π»ΠΈΠ·ΠΎΠ²Π°Π½ΠΎ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ ΠΈΠ·Π²Π΅ΡΡΠ½ΡΡ
Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΡΡΠ±ΡΠΈΠΊΠ°ΡΠΈΠΈ Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΡΠ°Π·Π»ΠΈΡΠ½ΡΡ
ΡΠΈΡΠ»Π΅Π½Π½ΡΡ
ΠΎΡΠ΅Π½ΠΎΠΊ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΠΎΠΉ Π·Π½Π°ΡΠΈΠΌΠΎΡΡΠΈ. Π Π°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ ΠΊΠ»Π°ΡΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΈ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΡΠΉ Π°Π²ΡΠΎΡΠ°ΠΌΠΈ ΠΌΠ΅ΡΠΎΠ΄ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ Π²Π΅ΡΠΎΠ² ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² Π½Π° ΠΏΡΠΈΠΌΠ΅ΡΠ΅ Π½Π°Π±ΠΎΡΠ° Π΄Π°Π½Π½ΡΡ
Β«20 Π½ΠΎΠ²ΠΎΡΡΠ½ΡΡ
Π³ΡΡΠΏΠΏΒ». ΠΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½Ρ ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΡΠ΅ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΡ ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠ°Π»ΡΠ½ΠΎΠΉ Π°ΠΏΡΠΎΠ±Π°ΡΠΈΠΈ ΡΠΈΡΡΠ΅ΠΌΡ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π½ΠΎΠ²ΠΎΡΡΠ½ΡΡ
ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠΉ, Π·Π°Π΄Π°ΡΠ° ΠΊΠΎΡΠΎΡΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΡΠΈΡΠΎΠ²Π°ΡΡ Π΄Π°Π½Π½ΡΠ΅ Π½Π° Π·Π°Π΄Π°Π½Π½ΡΠ΅ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ Π³ΡΡΠΏΠΏΡ. ΠΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠ³ΠΎ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΡΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎ ΠΏΠΎΠ²ΡΡΠΈΡΡ ΠΊΠ°ΡΠ΅ΡΡΠ²ΠΎ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π΄Π°ΠΆΠ΅ Ρ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ Π±Π°Π·ΠΎΠ²ΡΡ
ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² (ΠΌΡΠ»ΡΡΠΈΠ½ΠΎΠΌΠΈΠ°Π»ΡΠ½ΠΎΠ³ΠΎ Π½Π°ΠΈΠ²Π½ΠΎΠ³ΠΎ Π±Π°ΠΉΠ΅ΡΠΎΠ²ΡΠΊΠΎΠ³ΠΎ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡΠ°) Π΄ΠΎ ΡΡΠΎΠ²Π½Ρ Π»ΡΡΡΠΈΡ
ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² Π² ΡΡΠΎΠΉ ΠΎΠ±Π»Π°ΡΡΠΈ (ΠΌΠ΅ΡΠΎΠ΄ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ²) Π½Π° ΡΡΠ°Π»ΠΎΠ½Π½ΠΎΠΌ Π½Π°Π±ΠΎΡΠ΅ Π΄Π°Π½Π½ΡΡ