35 research outputs found
Contextual Outlier Interpretation
Outlier detection plays an essential role in many data-driven applications to
identify isolated instances that are different from the majority. While many
statistical learning and data mining techniques have been used for developing
more effective outlier detection algorithms, the interpretation of detected
outliers does not receive much attention. Interpretation is becoming
increasingly important to help people trust and evaluate the developed models
through providing intrinsic reasons why the certain outliers are chosen. It is
difficult, if not impossible, to simply apply feature selection for explaining
outliers due to the distinct characteristics of various detection models,
complicated structures of data in certain applications, and imbalanced
distribution of outliers and normal instances. In addition, the role of
contrastive contexts where outliers locate, as well as the relation between
outliers and contexts, are usually overlooked in interpretation. To tackle the
issues above, in this paper, we propose a novel Contextual Outlier
INterpretation (COIN) method to explain the abnormality of existing outliers
spotted by detectors. The interpretability for an outlier is achieved from
three aspects: outlierness score, attributes that contribute to the
abnormality, and contextual description of its neighborhoods. Experimental
results on various types of datasets demonstrate the flexibility and
effectiveness of the proposed framework compared with existing interpretation
approaches
Video Mining using LIM Based Clustering and Self Organizing Maps
AbstractVideo mining has grown as an energetic research area and given incremental concentration in recent years due to impressive and rapid raise in the volume of digital video databases. The aim of this research work is to find out new objects in videos. This work proposes a novel approach for video mining using LIM based clustering technique and self organizing maps to recognize novelty in the frames of video sequence. The proposed work is designed and implemented on MATLAB. It is tested with the sample videos and provides promising results. And it is suitable for day to day video mining applications and object detection systems including remote video surveillance in defense for national and international border tracking
The Nature of Novelty Detection
Sentence level novelty detection aims at reducing redundant sentences from a
sentence list. In the task, sentences appearing later in the list with no new
meanings are eliminated. Aiming at a better accuracy for detecting redundancy,
this paper reveals the nature of the novelty detection task currently
overlooked by the Novelty community Novelty as a combination of the partial
overlap (PO, two sentences sharing common facts) and complete overlap (CO, the
first sentence covers all the facts of the second sentence) relations. By
formalizing novelty detection as a combination of the two relations between
sentences, new viewpoints toward techniques dealing with Novelty are proposed.
Among the methods discussed, the similarity, overlap, pool and language
modeling approaches are commonly used. Furthermore, a novel approach, selected
pool method is provided, which is immediate following the nature of the task.
Experimental results obtained on all the three currently available novelty
datasets showed that selected pool is significantly better or no worse than the
current methods. Knowledge about the nature of the task also affects the
evaluation methodologies. We propose new evaluation measures for Novelty
according to the nature of the task, as well as possible directions for future
study.Comment: This paper pointed out the future direction for novelty detection
research. 37 pages, double spaced versio
Extracting News Events from Microblogs
Twitter stream has become a large source of information for many people, but
the magnitude of tweets and the noisy nature of its content have made
harvesting the knowledge from Twitter a challenging task for researchers for a
long time. Aiming at overcoming some of the main challenges of extracting the
hidden information from tweet streams, this work proposes a new approach for
real-time detection of news events from the Twitter stream. We divide our
approach into three steps. The first step is to use a neural network or deep
learning to detect news-relevant tweets from the stream. The second step is to
apply a novel streaming data clustering algorithm to the detected news tweets
to form news events. The third and final step is to rank the detected events
based on the size of the event clusters and growth speed of the tweet
frequencies. We evaluate the proposed system on a large, publicly available
corpus of annotated news events from Twitter. As part of the evaluation, we
compare our approach with a related state-of-the-art solution. Overall, our
experiments and user-based evaluation show that our approach on detecting
current (real) news events delivers a state-of-the-art performance
A Temporal Frequent Itemset-Based Clustering Approach For Discovering Event Episodes From News Sequence
When performing environmental scanning, organizations typically deal with a numerous of events and topics about their core business, relevant technique standards, competitors, and market, where each event or topic to monitor or track generally is associated with many news documents. To reduce information overload and information fatigues when monitoring or tracking such events, it is essential to develop an effective event episode discovery mechanism for organizing all news documents pertaining to an event of interest. In this study, we propose the time-adjoining frequent itemset-based event-episode discovery (TAFIED) technique. Based on the frequent itemset-based hierarchical clustering (FIHC) approach, our proposed TAFIED further considers the temporal characteristic of news articles, including the burst, novelty, and temporal proximity of features in an event episode, when discovering event episodes from the sequence of news articles pertaining to a specific event. Using the traditional feature-based HAC, HAC with a time-decaying function (HAC+TD), and FIHC techniques as performance benchmarks, our empirical evaluation results suggest that the proposed TAFIED technique outperforms all evaluation benchmarks in cluster recall and cluster precision
ΠΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΡΠΉ ΠΏΠΎΠ΄Ρ ΠΎΠ΄ ΠΊ Π²ΡΠ±ΠΎΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity. The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated. The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms. The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine). The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.Π Π°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π·Π°Π΄Π°ΡΠ° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² Π² ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
. Π‘ΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΠ΅ ΠΌΠ΅ΡΠΎΠ΄Ρ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡΠ½ΠΈΠ²Π΅ΡΡΠ°Π»ΡΠ½Ρ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°ΡΡ ΡΡΠ°Π±ΠΈΠ»ΡΠ½ΡΠΉ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ Π½Π° ΡΠ°Π·Π½ΡΡ
Π½Π°Π±ΠΎΡΠ°Ρ
Π΄Π°Π½Π½ΡΡ
. Π’ΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² Π·Π°Π²ΠΈΡΠΈΡ ΠΎΡ Π²ΡΠ±ΠΎΡΠ° ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· ΡΠ°Π³ΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠ°, ΠΈ Π΄Π»Ρ ΡΠ°Π·Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΉ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ Π½Π°Π±ΠΎΡΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠ΅ Π²ΡΠ΅ ΠΈΠ· ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ ΡΠ°Π±ΠΎΡΠ°ΡΡ Ρ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΠΌΠΈ Π΄Π°Π½Π½ΡΠΌΠΈ, Π²Π΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΠΊΠΎΡΠΎΡΡΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΠ·ΡΠ΅ΡΡΡ Π±ΠΎΠ»ΡΡΠΎΠΉ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΡΡ ΠΏΡΠΈ ΡΠΈΠ»ΡΠ½ΠΎΠΉ ΡΠ°Π·ΡΠ΅ΠΆΠ΅Π½Π½ΠΎΡΡΠΈ. ΠΠ°Π΄Π°ΡΠ° ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π² ΡΠ»Π΅Π΄ΡΡΡΠ΅ΠΉ ΠΏΠΎΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅: ΡΡΠ΅Π±ΡΠ΅ΡΡΡ ΠΏΡΠΎΠ²Π΅ΡΠΈΡΡ Π½ΠΎΠ²ΡΠΉ Π΄ΠΎΠΊΡΠΌΠ΅Π½Ρ, Π·Π°Π³ΡΡΠΆΠ°Π΅ΠΌΡΠΉ Π² ΠΏΡΠΈΠΊΠ»Π°Π΄Π½ΡΡ ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ ΡΠΈΡΡΠ΅ΠΌΡ (ΠΠΠΠ‘), Π½Π° ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΠΈΠ΅ Ρ
ΡΠ°Π½ΡΡΠ΅ΠΉΡΡ Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ². Π ΠΠΠΠ‘, ΠΎΠ±ΡΠ°Π±Π°ΡΡΠ²Π°ΡΡΠΈΡ
ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΠ΅ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ, Π½Π° ΠΌΠ΅ΡΠΎΠ΄Ρ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄ΡΠ²Π°ΡΡΡΡ ΡΠ»Π΅Π΄ΡΡΡΠΈΠ΅ ΠΎΠ³ΡΠ°Π½ΠΈΡΠ΅Π½ΠΈΡ: Π²ΡΡΠΎΠΊΠ°Ρ ΡΠΎΡΠ½ΠΎΡΡΡ, Π²ΡΡΠΈΡΠ»ΠΈΡΠ΅Π»ΡΠ½Π°Ρ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ, Π²ΠΎΡΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ², Π° ΡΠ°ΠΊΠΆΠ΅ ΠΎΠ±ΡΡΡΠ½ΠΈΠΌΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ. ΠΡΡΠ»Π΅Π΄ΡΡΡΡΡ ΠΌΠ΅ΡΠΎΠ΄Ρ, ΡΠ΄ΠΎΠ²Π»Π΅ΡΠ²ΠΎΡΡΡΡΠΈΠ΅ ΡΡΠΈΠΌ ΡΡΠ»ΠΎΠ²ΠΈΡΠΌ. Π ΡΠ°Π±ΠΎΡΠ΅ ΠΈΠ·ΡΡΠ°Π΅ΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡ ΠΎΡΠ΅Π½ΠΊΠΈ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ ΡΠΊΠ°Π»Π΅ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΏΡΡΠ΅ΠΌ Π²Π½Π΅Π΄ΡΠ΅Π½ΠΈΡ Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ°. ΠΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° ΡΡΡΠ°ΡΠ΅Π³ΠΈΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π² Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ ΠΏΠΎ ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΡ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ, ΠΏΡΠ΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡΡΠ°Ρ ΠΎΠ±ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΡΠΉ ΠΏΠΎΠ΄Π±ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠΎΠΊΠ°Π·Π°Π½ΠΎ, ΠΊΠ°ΠΊ Π½Π° ΡΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π²Π»ΠΈΡΠ΅Ρ Π²ΡΠ±ΠΎΡ Π²Π°ΡΠΈΠ°Π½ΡΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ, ΠΏΡΠΈΠ½ΡΠΈΠΏΠΎΠ² ΡΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΈΠΈ, ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡΡ
ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
Π½ΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΠΎ-ΡΠ΅Ρ
Π½ΠΈΡΠ΅ΡΠΊΠΈΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ²: ΡΡΠ°Π½Π΄Π°ΡΡΠΎΠ² Π² ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² ΡΡΠ΅ΡΠ΅ ΠΆΠ΅Π»Π΅Π·Π½ΡΡ
Π΄ΠΎΡΠΎΠ³. ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π»ΠΈΡΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄Ρ: Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΠ΅ ΠΈΠ½Π΄Π΅ΠΊΡΠ° Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΊΠ°ΠΊ ΡΠ°ΡΡΡΠΎΡΠ½ΠΈΡ Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅ΡΠ° ΠΌΠ΅ΠΆΠ΄Ρ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡΠΌΠΈ Π±Π»ΠΈΠ·ΠΎΡΡΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΊ ΡΠ΅Π½ΡΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠΌΡ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ; ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ ΠΈ ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ. ΠΠ΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²ΠΎ ΡΡΡΠΎΠΈΠ»ΠΎΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ TF-IDF ΠΈ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ARTM. Π’Π΅ΡΡΠΈΡΠΎΠ²Π°Π»ΠΈΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠΈΠΉ Π»Π΅Ρ), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡΠ½ΡΠΉ ΡΠ°ΠΊΡΠΎΡ Π²ΡΠ±ΡΠΎΡΠ°), OneClass SVM (Π²Π°ΡΠΈΠ°Π½Ρ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ²). ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΠΎΠ΄ΡΠ²Π΅ΡΠ΄ΠΈΠ» ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΠΎΠΉ ΡΡΡΠ°ΡΠ΅Π³ΠΈΠΈ Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΡΡΠ΅Π³ΠΎ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π΄Π»Ρ Π·Π°Π΄Π°Π½Π½ΠΎΠΉ ΡΠ΅ΠΊΡΡΠΎΠ²ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ. ΠΡΠΈ ΠΏΠΎΠΈΡΠΊΠ΅ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² ΡΠ°ΠΌΠΊΠ°Ρ
ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΊΠ»Π°ΡΡΠ΅ΡΠΈΠ·Π°ΡΠΈΠΈ ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΡΡΡΠ΅ΠΊΡΠΈΠ²Π΅Π½ ΠΌΠ΅ΡΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠ΅Π³ΠΎ Π»Π΅ΡΠ°. ΠΡΠΈ Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ TF-IDF ΡΠ΅Π»Π΅ΡΠΎΠΎΠ±ΡΠ°Π·Π½ΠΎ ΠΏΠΎΠ΄ΠΎΠ±ΡΠ°ΡΡ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΡΠ΅ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΡ ΡΠ»ΠΎΠ²Π°ΡΡ ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ ΠΌΠ΅ΡΠΎΠ΄ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² Ρ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΡΡΠ΅ΠΉ ΡΡΠ½ΠΊΡΠΈΠ΅ΠΉ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²Π°
ΠΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΡΠΉ ΠΏΠΎΠ΄Ρ ΠΎΠ΄ ΠΊ Π²ΡΠ±ΠΎΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
Π Π°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π·Π°Π΄Π°ΡΠ° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² Π² ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
. Π‘ΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΠ΅ ΠΌΠ΅ΡΠΎΠ΄Ρ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡΠ½ΠΈΠ²Π΅ΡΡΠ°Π»ΡΠ½Ρ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°ΡΡ ΡΡΠ°Π±ΠΈΠ»ΡΠ½ΡΠΉ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ Π½Π° ΡΠ°Π·Π½ΡΡ
Π½Π°Π±ΠΎΡΠ°Ρ
Π΄Π°Π½Π½ΡΡ
. Π’ΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² Π·Π°Π²ΠΈΡΠΈΡ ΠΎΡ Π²ΡΠ±ΠΎΡΠ° ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· ΡΠ°Π³ΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠ°, ΠΈ Π΄Π»Ρ ΡΠ°Π·Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΉ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ Π½Π°Π±ΠΎΡΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠ΅ Π²ΡΠ΅ ΠΈΠ· ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ ΡΠ°Π±ΠΎΡΠ°ΡΡ Ρ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΠΌΠΈ Π΄Π°Π½Π½ΡΠΌΠΈ, Π²Π΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΠΊΠΎΡΠΎΡΡΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΠ·ΡΠ΅ΡΡΡ Π±ΠΎΠ»ΡΡΠΎΠΉ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΡΡ ΠΏΡΠΈ ΡΠΈΠ»ΡΠ½ΠΎΠΉ ΡΠ°Π·ΡΠ΅ΠΆΠ΅Π½Π½ΠΎΡΡΠΈ. ΠΠ°Π΄Π°ΡΠ° ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π² ΡΠ»Π΅Π΄ΡΡΡΠ΅ΠΉ ΠΏΠΎΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅: ΡΡΠ΅Π±ΡΠ΅ΡΡΡ ΠΏΡΠΎΠ²Π΅ΡΠΈΡΡ Π½ΠΎΠ²ΡΠΉ Π΄ΠΎΠΊΡΠΌΠ΅Π½Ρ, Π·Π°Π³ΡΡΠΆΠ°Π΅ΠΌΡΠΉ Π² ΠΏΡΠΈΠΊΠ»Π°Π΄Π½ΡΡ ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ ΡΠΈΡΡΠ΅ΠΌΡ (ΠΠΠΠ‘), Π½Π° ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΠΈΠ΅ Ρ
ΡΠ°Π½ΡΡΠ΅ΠΉΡΡ Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ². Π ΠΠΠΠ‘, ΠΎΠ±ΡΠ°Π±Π°ΡΡΠ²Π°ΡΡΠΈΡ
ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΠ΅ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ, Π½Π° ΠΌΠ΅ΡΠΎΠ΄Ρ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄ΡΠ²Π°ΡΡΡΡ ΡΠ»Π΅Π΄ΡΡΡΠΈΠ΅ ΠΎΠ³ΡΠ°Π½ΠΈΡΠ΅Π½ΠΈΡ: Π²ΡΡΠΎΠΊΠ°Ρ ΡΠΎΡΠ½ΠΎΡΡΡ, Π²ΡΡΠΈΡΠ»ΠΈΡΠ΅Π»ΡΠ½Π°Ρ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ, Π²ΠΎΡΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ², Π° ΡΠ°ΠΊΠΆΠ΅ ΠΎΠ±ΡΡΡΠ½ΠΈΠΌΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ. ΠΡΡΠ»Π΅Π΄ΡΡΡΡΡ ΠΌΠ΅ΡΠΎΠ΄Ρ, ΡΠ΄ΠΎΠ²Π»Π΅ΡΠ²ΠΎΡΡΡΡΠΈΠ΅ ΡΡΠΈΠΌ ΡΡΠ»ΠΎΠ²ΠΈΡΠΌ. Π ΡΠ°Π±ΠΎΡΠ΅ ΠΈΠ·ΡΡΠ°Π΅ΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡ ΠΎΡΠ΅Π½ΠΊΠΈ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ ΡΠΊΠ°Π»Π΅ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΏΡΡΠ΅ΠΌ Π²Π½Π΅Π΄ΡΠ΅Π½ΠΈΡ Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ°. ΠΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° ΡΡΡΠ°ΡΠ΅Π³ΠΈΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π² Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ ΠΏΠΎ ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΡ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ, ΠΏΡΠ΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡΡΠ°Ρ ΠΎΠ±ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΡΠΉ ΠΏΠΎΠ΄Π±ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠΎΠΊΠ°Π·Π°Π½ΠΎ, ΠΊΠ°ΠΊ Π½Π° ΡΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π²Π»ΠΈΡΠ΅Ρ Π²ΡΠ±ΠΎΡ Π²Π°ΡΠΈΠ°Π½ΡΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ, ΠΏΡΠΈΠ½ΡΠΈΠΏΠΎΠ² ΡΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΈΠΈ, ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡΡ
ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
Π½ΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΠΎ-ΡΠ΅Ρ
Π½ΠΈΡΠ΅ΡΠΊΠΈΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ²: ΡΡΠ°Π½Π΄Π°ΡΡΠΎΠ² Π² ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² ΡΡΠ΅ΡΠ΅ ΠΆΠ΅Π»Π΅Π·Π½ΡΡ
Π΄ΠΎΡΠΎΠ³. ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π»ΠΈΡΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄Ρ: Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΠ΅ ΠΈΠ½Π΄Π΅ΠΊΡΠ° Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΊΠ°ΠΊ ΡΠ°ΡΡΡΠΎΡΠ½ΠΈΡ Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅ΡΠ° ΠΌΠ΅ΠΆΠ΄Ρ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡΠΌΠΈ Π±Π»ΠΈΠ·ΠΎΡΡΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΊ ΡΠ΅Π½ΡΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠΌΡ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ; ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ ΠΈ ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ. ΠΠ΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²ΠΎ ΡΡΡΠΎΠΈΠ»ΠΎΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ TF-IDF ΠΈ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ARTM. Π’Π΅ΡΡΠΈΡΠΎΠ²Π°Π»ΠΈΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠΈΠΉ Π»Π΅Ρ), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡΠ½ΡΠΉ ΡΠ°ΠΊΡΠΎΡ Π²ΡΠ±ΡΠΎΡΠ°), OneClass SVM (Π²Π°ΡΠΈΠ°Π½Ρ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ²). ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΠΎΠ΄ΡΠ²Π΅ΡΠ΄ΠΈΠ» ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΠΎΠΉ ΡΡΡΠ°ΡΠ΅Π³ΠΈΠΈ Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΡΡΠ΅Π³ΠΎ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π΄Π»Ρ Π·Π°Π΄Π°Π½Π½ΠΎΠΉ ΡΠ΅ΠΊΡΡΠΎΠ²ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ. ΠΡΠΈ ΠΏΠΎΠΈΡΠΊΠ΅ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² ΡΠ°ΠΌΠΊΠ°Ρ
ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΊΠ»Π°ΡΡΠ΅ΡΠΈΠ·Π°ΡΠΈΠΈ ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΡΡΡΠ΅ΠΊΡΠΈΠ²Π΅Π½ ΠΌΠ΅ΡΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠ΅Π³ΠΎ Π»Π΅ΡΠ°. ΠΡΠΈ Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ TF-IDF ΡΠ΅Π»Π΅ΡΠΎΠΎΠ±ΡΠ°Π·Π½ΠΎ ΠΏΠΎΠ΄ΠΎΠ±ΡΠ°ΡΡ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΡΠ΅ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΡ ΡΠ»ΠΎΠ²Π°ΡΡ ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ ΠΌΠ΅ΡΠΎΠ΄ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² Ρ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΡΡΠ΅ΠΉ ΡΡΠ½ΠΊΡΠΈΠ΅ΠΉ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²Π°