16 research outputs found

    Paraphrased plagiarism detection using sentence similarity

    No full text
    The paper describes an approach to plagiarism detection within Plag-EvalRus-2017 competition. Our system leverages deep parsing techniques to be able to detect moderately disguised plagiarism. We participated in the two tracks of the competition: source retrieval (sources detection) and text alignment (paraphrased plagiarism detection). There are various cases of plagiarism presented in datasets of both tracks. They vary by the level of disguise that was used while reusing text. The results show that our method performed quite well for detecting moderately disguised forms of plagiarism

    The ParaPlag: Russian dataset for paraphrased plagiarism detection

    No full text
    The paper presents the ParaPlag: a large text dataset in Russian to evaluate and compare quality metrics of different plagiarism detection approaches that deal with big data. The competition PlagEvalRus-2017 aimed to evaluate plagiarism detection methods uses the ParaPlag as a main dataset for source retrieval and text alignment tasks. The ParaPlag is open and available on the Web. We propose a guide for writers who want to contribute to the ParaPlag and extend it. The analysis of text rewrite techniques used by unscrupulous authors is also presented in our research

    Paraphrased plagiarism detection using sentence similarity

    No full text
    The paper describes an approach to plagiarism detection within Plag-EvalRus-2017 competition. Our system leverages deep parsing techniques to be able to detect moderately disguised plagiarism. We participated in the two tracks of the competition: source retrieval (sources detection) and text alignment (paraphrased plagiarism detection). There are various cases of plagiarism presented in datasets of both tracks. They vary by the level of disguise that was used while reusing text. The results show that our method performed quite well for detecting moderately disguised forms of plagiarism

    Query Formulation for Source Retrieval based on Named Entities and N-grams Extraction

    No full text

    The ParaPlag: Russian dataset for paraphrased plagiarism detection

    No full text
    The paper presents the ParaPlag: a large text dataset in Russian to evaluate and compare quality metrics of different plagiarism detection approaches that deal with big data. The competition PlagEvalRus-2017 aimed to evaluate plagiarism detection methods uses the ParaPlag as a main dataset for source retrieval and text alignment tasks. The ParaPlag is open and available on the Web. We propose a guide for writers who want to contribute to the ParaPlag and extend it. The analysis of text rewrite techniques used by unscrupulous authors is also presented in our research

    The Hybrid Method for Accurate Patent Classification

    No full text
    This article is dedicated to stacking of two approaches of patent classification. First is based on linguistically-supported k-nearest neighbors algorithm using the method of search for topically similar documents based on a comparison of vectors of lexical descriptors. Second is the word embeddings based fastText, where the sentence (or a document) vector is obtained by averaging the n-gram embeddings, and then a multinomial logistic regression exploits these vectors as features. We show in Russian and English datasets that stacking classifier shows better results compared to single classifiers. Β© 2019, Pleiades Publishing, Ltd

    Evaluating host-based intrusion detection on the adfa-wd and adfa-wd: Saa datasets

    No full text
    With the growth of the internet and the development of new technologies also originates advancements in methods of cyber-Attacks such as zero-day and stealth attacks, a more effective method of network safety is essential for network stability for both personal use and businesses. This research paper will assess anomalous patterns of Normal Pattern and Abnormal Pattern comprised of system calls based on the Dynamic-Link Library. The two datasets assessed are designed on the Windows Operating System on a Host-based Intrusion Detection System; comprised of the Australian Defence force Windows Dataset (ADFA-WD) and Australian Defence Force Academy Windows Dataset: Stealth Attacks Addendum (ADFA-WD:SAA). The development of a binary feature space is developed based on the common vulnerabilities and exposures at the time of the creation of the dataset. The data mining techniques implemented are Support Vector Machine classifier with sigmoid and RBF kernels is compared to the Random Forest classifier. Β© 2017 CEUR-WS. All rights reserved

    Method for Author Attribution Using Word Embeddings

    No full text
    In this paper we look at a methodology of revealing an unknown document’s author through the use of extracting the author's characteristics from their writing style The method used explores identifying sources of unknown documents, using a model of distributive semantics to form a set of queries to a search engine. The dataset used is the PAN @ CLEF 2019 shared task on Cross-domain Authorship Attribution are in the following languages: English, French, Italian, and Spanish, each of which contains 5 problematic questions, which gives a total of 20 problematic questions. The problem relates to Natural Language Programming where the process is done through the attribution of the user that can be used to identify an author’s work. The method explores identifying sources of unknown document, using a model of distributive semantics to form a set of queries to a search engine. The method used to reveal the unknown authors is done through distributional semantics; this is based on the following hypothesis: the linguistic units that are observed in close contexts have similar semantic meaning, in this area when looking at linguistics this is calculated based on the proximity of linguistic elements in terms of semantic load based on their distribution in large textual boxes.Π’ этой ΡΡ‚Π°Ρ‚ΡŒΠ΅ рассматриваСтся мСтодология опрСдСлСния Π°Π²Ρ‚ΠΎΡ€Π° тСкста с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ Π°Π½Π°Π»ΠΈΠ·Π° стиля письма ΠΈ опрСдСлСния особСнностСй, Ρ…Π°Ρ€Π°ΠΊΡ‚Π΅Ρ€Π½Ρ‹Ρ… для ΠΊΠΎΠ½ΠΊΡ€Π΅Ρ‚Π½ΠΎΠ³ΠΎ Π°Π²Ρ‚ΠΎΡ€Π°. Π”Π°Π½Π½Ρ‹ΠΉ ΠΌΠ΅Ρ‚ΠΎΠ΄ исслСдуСт возмоТности ΠΈΠ΄Π΅Π½Ρ‚ΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΠΈ источников Π°Π½Π°Π»ΠΈΠ·ΠΈΡ€ΡƒΠ΅ΠΌΡ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² с использованиСм ΠΌΠΎΠ΄Π΅Π»ΠΈ дистрибутивной сСмантики для формирования Π½Π°Π±ΠΎΡ€Π° запросов для поисковой ΠΌΠ°ΡˆΠΈΠ½Ρ‹. Π˜ΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅ΠΌΡ‹ΠΉ Π½Π°Π±ΠΎΡ€ Π΄Π°Π½Π½Ρ‹Ρ… являСтся совмСстной Π·Π°Π΄Π°Ρ‡Π΅ΠΉ PAN @ CLEF 2019 Π² ΠšΡ€ΠΎΡΡ-Π΄ΠΎΠΌΠ΅Π½Π½ΠΎΠΉ Атрибуции Авторских ΠΏΡ€Π°Π² Π½Π° Ρ‚Π°ΠΊΠΈΡ… языках ΠΊΠ°ΠΊ английский, французский, ΠΈΡ‚Π°Π»ΡŒΡΠ½ΡΠΊΠΈΠΉ ΠΈ испанский, ΠΊΠ°ΠΆΠ΄Ρ‹ΠΉ ΠΈΠ· ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… ΠΈΠΌΠ΅Π΅Ρ‚ 5 Π·Π°Π΄Π°Ρ‡, Ρ‡Ρ‚ΠΎ Π² совокупности ставит 20 Π·Π°Π΄Π°Ρ‡. ΠžΠ±Ρ‰Π°Ρ Π·Π°Π΄Π°Ρ‡Π°, ΠΎΠ±ΡŠΠ΅Π΄ΠΈΠ½ΡΡŽΡ‰Π°Ρ эти 20 Π·Π°Π΄Π°Ρ‡, связана с ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π½Π° СстСствСнном языкС, Π² Ρ€Π°ΠΌΠΊΠ°Ρ… ΠΊΠΎΡ‚ΠΎΡ€ΠΎΠ³ΠΎ Π΄Π°Π½Π½Ρ‹ΠΉ процСсс осущСствляСтся Ρ‡Π΅Ρ€Π΅Π· Π°Ρ‚Ρ€ΠΈΠ±ΡƒΡ†ΠΈΡŽ ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»Ρ, которая ΠΌΠΎΠΆΠ΅Ρ‚ Π±Ρ‹Ρ‚ΡŒ использована для ΠΈΠ΄Π΅Π½Ρ‚ΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΠΈ Ρ€Π°Π±ΠΎΡ‚Ρ‹ Π°Π²Ρ‚ΠΎΡ€Π°. ΠŸΡ€ΠΈΠ²Π΅Π΄Π΅Π½Π½Ρ‹ΠΉ здСсь ΠΌΠ΅Ρ‚ΠΎΠ΄ исслСдуСт выявлСниС источников нСизвСстного Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π°, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡ модСль дистрибутивной сСмантики для формирования Π½Π°Π±ΠΎΡ€Π° запросов ΠΊ поисковой систСмС. ΠœΠ΅Ρ‚ΠΎΠ΄, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅ΠΌΡ‹ΠΉ для выявлСния нСизвСстных Π°Π²Ρ‚ΠΎΡ€ΠΎΠ², базируСтся Π½Π° дистрибутивной сСмантикС ΠΈ Π½Π° ΡΠ»Π΅Π΄ΡƒΡŽΡ‰Π΅ΠΉ Π³ΠΈΠΏΠΎΡ‚Π΅Π·Π΅: лингвистичСскиС Π΅Π΄ΠΈΠ½ΠΈΡ†Ρ‹, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΏΡ€ΠΈΡΡƒΡ‚ΡΡ‚Π²ΡƒΡŽΡ‚ Π² сходных контСкстах, ΠΈΠΌΠ΅ΡŽΡ‚ сходноС сСмантичСскоС Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅. АнализируСмыС лингвистичСскиС Π΅Π΄ΠΈΠ½ΠΈΡ†Ρ‹ Ρ€Π°ΡΡΡ‡ΠΈΡ‚Ρ‹Π²Π°ΡŽΡ‚ΡΡ, исходя ΠΈΠ· близости лингвистичСских элСмСнтов с Ρ‚ΠΎΡ‡ΠΊΠΈ зрСния сСмантичСской Π½Π°Π³Ρ€ΡƒΠ·ΠΊΠΈ, основанной Π½Π° ΠΈΡ… распрСдСлСнии Π² Π±ΠΎΠ»ΡŒΡˆΠΈΡ… тСкстовых ΠΎΡ‚Ρ€Ρ‹Π²ΠΊΠ°Ρ…

    Distributional models and auxiliary methods for determining the hypernyms of words in Russian

    No full text

    ΠžΡ†Π΅Π½ΠΊΠ° информативности ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² Π½Π° основС характСристики тСматичСской значимости ΠΏΡ€ΠΈ классификации ΠΏΠΎΡ‚ΠΎΠΊΠ° новостных сообщСний

    No full text
    The paper presents an approach for ranking the most valuable features for text classification task. The introduced Topical Importance Characteristic leverages the feature selection method comprising the information about the distributions of words or phrases among the topics. We compare this method to well-known TF-IDF approach and use the introduced word-ranking scheme in two classifiers: Random Forrest and Multinomial NaΓ―ve Bayes. The Accuracy of classification results was tested in the β€œ20-Newsgroups” dataset. The developed approach outperforms TF-IDF-based methods and matches the Accuracy achieved by the more powerful state of the art approaches such as SVC on the same dataset.Π‘Ρ‚Π°Ρ‚ΡŒΡ посвящСна ΠΎΡ†Π΅Π½ΠΊΠ΅ качСства Π½Π΅ΡΠΊΠΎΠ»ΡŒΠΊΠΈΡ… ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² тСматичСской классификации новостных сообщСний. Π Π΅Π°Π»ΠΈΠ·ΠΎΠ²Π°Π½ΠΎ нСсколько извСстных Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² тСматичСской Ρ€ΡƒΠ±Ρ€ΠΈΠΊΠ°Ρ†ΠΈΠΈ с использованиСм Π² качСствС ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Ρ… числСнных ΠΎΡ†Π΅Π½ΠΎΠΊ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½ΠΎΠΉ значимости. РассмотрСны классичСский ΠΈ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π½Ρ‹ΠΉ Π°Π²Ρ‚ΠΎΡ€Π°ΠΌΠΈ ΠΌΠ΅Ρ‚ΠΎΠ΄ опрСдСлСния вСсов ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² Π½Π° ΠΏΡ€ΠΈΠΌΠ΅Ρ€Π΅ Π½Π°Π±ΠΎΡ€Π° Π΄Π°Π½Π½Ρ‹Ρ… Β«20 новостных Π³Ρ€ΡƒΠΏΠΏΒ». ΠŸΡ€Π΅Π΄ΡΡ‚Π°Π²Π»Π΅Π½Ρ‹ ΠΏΠΎΠ»ΡƒΡ‡Π΅Π½Π½Ρ‹Π΅ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ ΡΠΊΡΠΏΠ΅Ρ€ΠΈΠΌΠ΅Π½Ρ‚Π°Π»ΡŒΠ½ΠΎΠΉ Π°ΠΏΡ€ΠΎΠ±Π°Ρ†ΠΈΠΈ систСмы тСматичСской классификации новостных сообщСний, Π·Π°Π΄Π°Ρ‡Π° ΠΊΠΎΡ‚ΠΎΡ€ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡ„ΠΈΡ†ΠΈΡ€ΠΎΠ²Π°Ρ‚ΡŒ Π΄Π°Π½Π½Ρ‹Π΅ Π½Π° Π·Π°Π΄Π°Π½Π½Ρ‹Π΅ тСматичСскиС Π³Ρ€ΡƒΠΏΠΏΡ‹. ΠŸΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠ³ΠΎ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° позволяСт сущСствСнно ΠΏΠΎΠ²Ρ‹ΡΠΈΡ‚ΡŒ качСство классификации Π΄Π°ΠΆΠ΅ с ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ Π±Π°Π·ΠΎΠ²Ρ‹Ρ… ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² (ΠΌΡƒΠ»ΡŒΡ‚ΠΈΠ½ΠΎΠΌΠΈΠ°Π»ΡŒΠ½ΠΎΠ³ΠΎ Π½Π°ΠΈΠ²Π½ΠΎΠ³ΠΎ байСсовского классификатора) Π΄ΠΎ уровня Π»ΡƒΡ‡ΡˆΠΈΡ… ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² Π² этой области (ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ²) Π½Π° эталонном Π½Π°Π±ΠΎΡ€Π΅ Π΄Π°Π½Π½Ρ‹Ρ…
    corecore