Search CORE

16 research outputs found

Paraphrased plagiarism detection using sentence similarity

Author: Sochenkov I.V.
Zubarev D.V.
Publication venue: Rossiiskii Gosudarstvennyi Gumanitarnyi Universitet
Publication date: 03/03/2020
Field of study

The paper describes an approach to plagiarism detection within Plag-EvalRus-2017 competition. Our system leverages deep parsing techniques to be able to detect moderately disguised plagiarism. We participated in the two tracks of the competition: source retrieval (sources detection) and text alignment (paraphrased plagiarism detection). There are various cases of plagiarism presented in datasets of both tracks. They vary by the level of disguise that was used while reusing text. The results show that our method performed quite well for detecting moderately disguised forms of plagiarism

National Open Repository Aggregator (NORA)

The ParaPlag: Russian dataset for paraphrased plagiarism detection

Author: Smirnov I.V.
Sochenkov I.V.
Zubarev D.V.
Publication venue: Rossiiskii Gosudarstvennyi Gumanitarnyi Universitet
Publication date: 03/03/2020
Field of study

The paper presents the ParaPlag: a large text dataset in Russian to evaluate and compare quality metrics of different plagiarism detection approaches that deal with big data. The competition PlagEvalRus-2017 aimed to evaluate plagiarism detection methods uses the ParaPlag as a main dataset for source retrieval and text alignment tasks. The ParaPlag is open and available on the Web. We propose a guide for writers who want to contribute to the ParaPlag and extend it. The analysis of text rewrite techniques used by unscrupulous authors is also presented in our research

National Open Repository Aggregator (NORA)

Paraphrased plagiarism detection using sentence similarity

Author: Sochenkov I.V.
Zubarev D.V.
Publication venue: Rossiiskii Gosudarstvennyi Gumanitarnyi Universitet
Publication date
Field of study

RUDN Repository

Query Formulation for Source Retrieval based on Named Entities and N-grams Extraction

Author: Maluleka R.
Sochenkov I.V.
Publication venue: Федеральное государственное учреждение "Федеральный исследовательский центр "Информатика и управление" Российской академии наук
Publication date
Field of study

RUDN Repository

The ParaPlag: Russian dataset for paraphrased plagiarism detection

Author: Smirnov I.V.
Sochenkov I.V.
Zubarev D.V.
Publication venue: Rossiiskii Gosudarstvennyi Gumanitarnyi Universitet
Publication date
Field of study

RUDN Repository

The Hybrid Method for Accurate Patent Classification

Author: Sochenkov I.V.
Yadrintsev V.V.
Publication venue: 'Pleiades Publishing Ltd'
Publication date
Field of study

This article is dedicated to stacking of two approaches of patent classification. First is based on linguistically-supported k-nearest neighbors algorithm using the method of search for topically similar documents based on a comparison of vectors of lexical descriptors. Second is the word embeddings based fastText, where the sentence (or a document) vector is obtained by averaging the n-gram embeddings, and then a multinomial logistic regression exploits these vectors as features. We show in Russian and English datasets that stacking classifier shows better results compared to single classifiers. © 2019, Pleiades Publishing, Ltd

RUDN Repository

Evaluating host-based intrusion detection on the adfa-wd and adfa-wd: Saa datasets

Author: Simon C.K.
Sochenkov I.V.
Publication venue: CEUR-WS
Publication date
Field of study

With the growth of the internet and the development of new technologies also originates advancements in methods of cyber-Attacks such as zero-day and stealth attacks, a more effective method of network safety is essential for network stability for both personal use and businesses. This research paper will assess anomalous patterns of Normal Pattern and Abnormal Pattern comprised of system calls based on the Dynamic-Link Library. The two datasets assessed are designed on the Windows Operating System on a Host-based Intrusion Detection System; comprised of the Australian Defence force Windows Dataset (ADFA-WD) and Australian Defence Force Academy Windows Dataset: Stealth Attacks Addendum (ADFA-WD:SAA). The development of a binary feature space is developed based on the common vulnerabilities and exposures at the time of the creation of the dataset. The data mining techniques implemented are Support Vector Machine classifier with sigmoid and RBF kernels is compared to the Random Forest classifier. © 2017 CEUR-WS. All rights reserved

RUDN Repository

Method for Author Attribution Using Word Embeddings

Author: Simon C.K.
Sochenkov I.V.
Publication venue: Фонд содействия развитию интернет-медиа, ИТ-образования, человеческого потенциала Лига интернет-медиа
Publication date
Field of study

In this paper we look at a methodology of revealing an unknown document’s author through the use of extracting the author's characteristics from their writing style The method used explores identifying sources of unknown documents, using a model of distributive semantics to form a set of queries to a search engine. The dataset used is the PAN @ CLEF 2019 shared task on Cross-domain Authorship Attribution are in the following languages: English, French, Italian, and Spanish, each of which contains 5 problematic questions, which gives a total of 20 problematic questions. The problem relates to Natural Language Programming where the process is done through the attribution of the user that can be used to identify an author’s work. The method explores identifying sources of unknown document, using a model of distributive semantics to form a set of queries to a search engine. The method used to reveal the unknown authors is done through distributional semantics; this is based on the following hypothesis: the linguistic units that are observed in close contexts have similar semantic meaning, in this area when looking at linguistics this is calculated based on the proximity of linguistic elements in terms of semantic load based on their distribution in large textual boxes.В этой статье рассматривается методология определения автора текста с помощью анализа стиля письма и определения особенностей, характерных для конкретного автора. Данный метод исследует возможности идентификации источников анализируемых документов с использованием модели дистрибутивной семантики для формирования набора запросов для поисковой машины. Используемый набор данных является совместной задачей PAN @ CLEF 2019 в Кросс-доменной Атрибуции Авторских прав на таких языках как английский, французский, итальянский и испанский, каждый из которых имеет 5 задач, что в совокупности ставит 20 задач. Общая задача, объединяющая эти 20 задач, связана с программированием на естественном языке, в рамках которого данный процесс осуществляется через атрибуцию пользователя, которая может быть использована для идентификации работы автора. Приведенный здесь метод исследует выявление источников неизвестного документа, используя модель дистрибутивной семантики для формирования набора запросов к поисковой системе. Метод, используемый для выявления неизвестных авторов, базируется на дистрибутивной семантике и на следующей гипотезе: лингвистические единицы, которые присутствуют в сходных контекстах, имеют сходное семантическое значение. Анализируемые лингвистические единицы рассчитываются, исходя из близости лингвистических элементов с точки зрения семантической нагрузки, основанной на их распределении в больших текстовых отрывках

RUDN Repository

Distributional models and auxiliary methods for determining the hypernyms of words in Russian

Author: Ryzhova A.A.
Sochenkov I.V.
Yadrintsev V.V.
Publication venue: Российский государственный гуманитарный университет
Publication date
Field of study

RUDN Repository

Оценка информативности признаков на основе характеристики тематической значимости при классификации потока новостных сообщений

Author: Sochenkov I.V.
Zharikova S.A.
Zhebel V.V.
Publication venue: Федеральное государственное учреждение "Федеральный исследовательский центр "Информатика и управление" Российской академии наук
Publication date
Field of study

The paper presents an approach for ranking the most valuable features for text classification task. The introduced Topical Importance Characteristic leverages the feature selection method comprising the information about the distributions of words or phrases among the topics. We compare this method to well-known TF-IDF approach and use the introduced word-ranking scheme in two classifiers: Random Forrest and Multinomial Naïve Bayes. The Accuracy of classification results was tested in the “20-Newsgroups” dataset. The developed approach outperforms TF-IDF-based methods and matches the Accuracy achieved by the more powerful state of the art approaches such as SVC on the same dataset.Статья посвящена оценке качества нескольких методов тематической классификации новостных сообщений. Реализовано несколько известных алгоритмов тематической рубрикации с использованием в качестве признаков различных численных оценок информационной значимости. Рассмотрены классический и предложенный авторами метод определения весов признаков на примере набора данных «20 новостных групп». Представлены полученные результаты экспериментальной апробации системы тематической классификации новостных сообщений, задача которой классифицировать данные на заданные тематические группы. Применение предложенного метода позволяет существенно повысить качество классификации даже с применением базовых методов (мультиномиального наивного байесовского классификатора) до уровня лучших методов в этой области (метод опорных векторов) на эталонном наборе данных

RUDN Repository