35 research outputs found

    Contextual Outlier Interpretation

    Full text link
    Outlier detection plays an essential role in many data-driven applications to identify isolated instances that are different from the majority. While many statistical learning and data mining techniques have been used for developing more effective outlier detection algorithms, the interpretation of detected outliers does not receive much attention. Interpretation is becoming increasingly important to help people trust and evaluate the developed models through providing intrinsic reasons why the certain outliers are chosen. It is difficult, if not impossible, to simply apply feature selection for explaining outliers due to the distinct characteristics of various detection models, complicated structures of data in certain applications, and imbalanced distribution of outliers and normal instances. In addition, the role of contrastive contexts where outliers locate, as well as the relation between outliers and contexts, are usually overlooked in interpretation. To tackle the issues above, in this paper, we propose a novel Contextual Outlier INterpretation (COIN) method to explain the abnormality of existing outliers spotted by detectors. The interpretability for an outlier is achieved from three aspects: outlierness score, attributes that contribute to the abnormality, and contextual description of its neighborhoods. Experimental results on various types of datasets demonstrate the flexibility and effectiveness of the proposed framework compared with existing interpretation approaches

    Video Mining using LIM Based Clustering and Self Organizing Maps

    Get PDF
    AbstractVideo mining has grown as an energetic research area and given incremental concentration in recent years due to impressive and rapid raise in the volume of digital video databases. The aim of this research work is to find out new objects in videos. This work proposes a novel approach for video mining using LIM based clustering technique and self organizing maps to recognize novelty in the frames of video sequence. The proposed work is designed and implemented on MATLAB. It is tested with the sample videos and provides promising results. And it is suitable for day to day video mining applications and object detection systems including remote video surveillance in defense for national and international border tracking

    The Nature of Novelty Detection

    Full text link
    Sentence level novelty detection aims at reducing redundant sentences from a sentence list. In the task, sentences appearing later in the list with no new meanings are eliminated. Aiming at a better accuracy for detecting redundancy, this paper reveals the nature of the novelty detection task currently overlooked by the Novelty community βˆ’- Novelty as a combination of the partial overlap (PO, two sentences sharing common facts) and complete overlap (CO, the first sentence covers all the facts of the second sentence) relations. By formalizing novelty detection as a combination of the two relations between sentences, new viewpoints toward techniques dealing with Novelty are proposed. Among the methods discussed, the similarity, overlap, pool and language modeling approaches are commonly used. Furthermore, a novel approach, selected pool method is provided, which is immediate following the nature of the task. Experimental results obtained on all the three currently available novelty datasets showed that selected pool is significantly better or no worse than the current methods. Knowledge about the nature of the task also affects the evaluation methodologies. We propose new evaluation measures for Novelty according to the nature of the task, as well as possible directions for future study.Comment: This paper pointed out the future direction for novelty detection research. 37 pages, double spaced versio

    Extracting News Events from Microblogs

    Full text link
    Twitter stream has become a large source of information for many people, but the magnitude of tweets and the noisy nature of its content have made harvesting the knowledge from Twitter a challenging task for researchers for a long time. Aiming at overcoming some of the main challenges of extracting the hidden information from tweet streams, this work proposes a new approach for real-time detection of news events from the Twitter stream. We divide our approach into three steps. The first step is to use a neural network or deep learning to detect news-relevant tweets from the stream. The second step is to apply a novel streaming data clustering algorithm to the detected news tweets to form news events. The third and final step is to rank the detected events based on the size of the event clusters and growth speed of the tweet frequencies. We evaluate the proposed system on a large, publicly available corpus of annotated news events from Twitter. As part of the evaluation, we compare our approach with a related state-of-the-art solution. Overall, our experiments and user-based evaluation show that our approach on detecting current (real) news events delivers a state-of-the-art performance

    A Temporal Frequent Itemset-Based Clustering Approach For Discovering Event Episodes From News Sequence

    Get PDF
    When performing environmental scanning, organizations typically deal with a numerous of events and topics about their core business, relevant technique standards, competitors, and market, where each event or topic to monitor or track generally is associated with many news documents. To reduce information overload and information fatigues when monitoring or tracking such events, it is essential to develop an effective event episode discovery mechanism for organizing all news documents pertaining to an event of interest. In this study, we propose the time-adjoining frequent itemset-based event-episode discovery (TAFIED) technique. Based on the frequent itemset-based hierarchical clustering (FIHC) approach, our proposed TAFIED further considers the temporal characteristic of news articles, including the burst, novelty, and temporal proximity of features in an event episode, when discovering event episodes from the sequence of news articles pertaining to a specific event. Using the traditional feature-based HAC, HAC with a time-decaying function (HAC+TD), and FIHC techniques as performance benchmarks, our empirical evaluation results suggest that the proposed TAFIED technique outperforms all evaluation benchmarks in cluster recall and cluster precision

    ΠžΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½Ρ‹ΠΉ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ ΠΊ Π²Ρ‹Π±ΠΎΡ€Ρƒ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… тСкстовых коллСкциях

    Get PDF
    The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity. The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated. The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms. The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine). The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.РассматриваСтся Π·Π°Π΄Π°Ρ‡Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½Ρ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² Π² тСкстовых коллСкциях. Π‘ΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΠ΅ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ выявлСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡƒΠ½ΠΈΠ²Π΅Ρ€ΡΠ°Π»ΡŒΠ½Ρ‹ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°ΡŽΡ‚ ΡΡ‚Π°Π±ΠΈΠ»ΡŒΠ½Ρ‹ΠΉ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ Π½Π° Ρ€Π°Π·Π½Ρ‹Ρ… Π½Π°Π±ΠΎΡ€Π°Ρ… Π΄Π°Π½Π½Ρ‹Ρ…. Π’ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ² зависит ΠΎΡ‚ Π²Ρ‹Π±ΠΎΡ€Π° ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· шагов Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ°, ΠΈ для Ρ€Π°Π·Π½Ρ‹Ρ… ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ Π½Π°Π±ΠΎΡ€Ρ‹ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². НС всС ΠΈΠ· ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΡ… Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ эффСктивно Ρ€Π°Π±ΠΎΡ‚Π°ΡŽΡ‚ с тСкстовыми Π΄Π°Π½Π½Ρ‹ΠΌΠΈ, Π²Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ прСдставлСниС ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… характСризуСтся большой Ρ€Π°Π·ΠΌΠ΅Ρ€Π½ΠΎΡΡ‚ΡŒΡŽ ΠΏΡ€ΠΈ сильной разрСТСнности. Π—Π°Π΄Π°Ρ‡Π° поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ рассматриваСтся Π² ΡΠ»Π΅Π΄ΡƒΡŽΡ‰Π΅ΠΉ постановкС: трСбуСтся ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΈΡ‚ΡŒ Π½ΠΎΠ²Ρ‹ΠΉ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚, Π·Π°Π³Ρ€ΡƒΠΆΠ°Π΅ΠΌΡ‹ΠΉ Π² ΠΏΡ€ΠΈΠΊΠ»Π°Π΄Π½ΡƒΡŽ ΠΈΠ½Ρ‚Π΅Π»Π»Π΅ΠΊΡ‚ΡƒΠ°Π»ΡŒΠ½ΡƒΡŽ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½ΡƒΡŽ систСму (ПИИБ), Π½Π° соотвСтствиС хранящСйся Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ². Π’ ПИИБ, ΠΎΠ±Ρ€Π°Π±Π°Ρ‚Ρ‹Π²Π°ΡŽΡ‰ΠΈΡ… ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Π΅ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρ‹, Π½Π° ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄Ρ‹Π²Π°ΡŽΡ‚ΡΡ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΠΈΠ΅ ограничСния: высокая Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ, Π²Ρ‹Ρ‡ΠΈΡΠ»ΠΈΡ‚Π΅Π»ΡŒΠ½Π°Ρ ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ, Π²ΠΎΡΠΏΡ€ΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ², Π° Ρ‚Π°ΠΊΠΆΠ΅ ΠΎΠ±ΡŠΡΡΠ½ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ. Π˜ΡΡΠ»Π΅Π΄ΡƒΡŽΡ‚ΡΡ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹, ΡƒΠ΄ΠΎΠ²Π»Π΅Ρ‚Π²ΠΎΡ€ΡΡŽΡ‰ΠΈΠ΅ этим условиям. Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ изучаСтся Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ ΠΎΡ†Π΅Π½ΠΊΠΈ тСкстовых Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ шкалС Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΏΡƒΡ‚Π΅ΠΌ внСдрСния Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΡŽ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π°. ΠŸΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π° стратСгия обнаруТСния Π² Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ‹ ΠΏΠΎ ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΡŽ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ, ΠΏΡ€Π΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡŽΡ‰Π°Ρ обоснованный ΠΏΠΎΠ΄Π±ΠΎΡ€ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². Показано, ΠΊΠ°ΠΊ Π½Π° Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ влияСт Π²Ρ‹Π±ΠΎΡ€ Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΏΡ€ΠΈΠ½Ρ†ΠΈΠΏΠΎΠ² Ρ‚ΠΎΠΊΠ΅Π½ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² сниТСния размСрности ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ЭкспСримСнт ΠΏΡ€ΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡƒΡ… ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… коллСкциях Π½ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π½ΠΎ-тСхничСских Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²: стандартов Π² ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΈ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² сфСрС ΠΆΠ΅Π»Π΅Π·Π½Ρ‹Ρ… Π΄ΠΎΡ€ΠΎΠ³. Использовались ΠΏΠΎΠ΄Ρ…ΠΎΠ΄Ρ‹: вычислСниС индСкса Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΊΠ°ΠΊ расстояния Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅Ρ€Π° ΠΌΠ΅ΠΆΠ΄Ρƒ распрСдСлСниями близости Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΊ Ρ†Π΅Π½Ρ‚Ρ€Ρƒ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΌΡƒ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρƒ; оптимизация Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² зависимости ΠΎΡ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈ сниТСния размСрности. Π’Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ пространство ΡΡ‚Ρ€ΠΎΠΈΠ»ΠΎΡΡŒ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ прСобразования TF-IDF ΠΈ тСматичСского модСлирования ARTM. Π’Π΅ΡΡ‚ΠΈΡ€ΠΎΠ²Π°Π»ΠΈΡΡŒ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΡ‹ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰ΠΈΠΉ лСс), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡŒΠ½Ρ‹ΠΉ Ρ„Π°ΠΊΡ‚ΠΎΡ€ выброса), OneClass SVM (Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ²). ЭкспСримСнт ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€Π΄ΠΈΠ» ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½ΠΎΠΉ стратСгии для опрСдСлСния подходящСго ΠΌΠ΅Ρ‚ΠΎΠ΄Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ для Π·Π°Π΄Π°Π½Π½ΠΎΠΉ тСкстовой ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ. ΠŸΡ€ΠΈ поискС Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² Ρ€Π°ΠΌΠΊΠ°Ρ… тСматичСской кластСризации ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² эффСктивСн ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰Π΅Π³ΠΎ лСса. ΠŸΡ€ΠΈ Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ TF-IDF цСлСсообразно ΠΏΠΎΠ΄ΠΎΠ±Ρ€Π°Ρ‚ΡŒ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹Π΅ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ словаря ΠΈ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ² с ΡΠΎΠΎΡ‚Π²Π΅Ρ‚ΡΡ‚Π²ΡƒΡŽΡ‰Π΅ΠΉ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠ΅ΠΉ прСобразования ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ пространства

    ΠžΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½Ρ‹ΠΉ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ ΠΊ Π²Ρ‹Π±ΠΎΡ€Ρƒ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… тСкстовых коллСкциях

    Get PDF
    РассматриваСтся Π·Π°Π΄Π°Ρ‡Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½Ρ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² Π² тСкстовых коллСкциях. Π‘ΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΠ΅ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ выявлСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡƒΠ½ΠΈΠ²Π΅Ρ€ΡΠ°Π»ΡŒΠ½Ρ‹ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°ΡŽΡ‚ ΡΡ‚Π°Π±ΠΈΠ»ΡŒΠ½Ρ‹ΠΉ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ Π½Π° Ρ€Π°Π·Π½Ρ‹Ρ… Π½Π°Π±ΠΎΡ€Π°Ρ… Π΄Π°Π½Π½Ρ‹Ρ…. Π’ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ² зависит ΠΎΡ‚ Π²Ρ‹Π±ΠΎΡ€Π° ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· шагов Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ°, ΠΈ для Ρ€Π°Π·Π½Ρ‹Ρ… ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ Π½Π°Π±ΠΎΡ€Ρ‹ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². НС всС ΠΈΠ· ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΡ… Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ эффСктивно Ρ€Π°Π±ΠΎΡ‚Π°ΡŽΡ‚ с тСкстовыми Π΄Π°Π½Π½Ρ‹ΠΌΠΈ, Π²Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ прСдставлСниС ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… характСризуСтся большой Ρ€Π°Π·ΠΌΠ΅Ρ€Π½ΠΎΡΡ‚ΡŒΡŽ ΠΏΡ€ΠΈ сильной разрСТСнности. Π—Π°Π΄Π°Ρ‡Π° поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ рассматриваСтся Π² ΡΠ»Π΅Π΄ΡƒΡŽΡ‰Π΅ΠΉ постановкС: трСбуСтся ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΈΡ‚ΡŒ Π½ΠΎΠ²Ρ‹ΠΉ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚, Π·Π°Π³Ρ€ΡƒΠΆΠ°Π΅ΠΌΡ‹ΠΉ Π² ΠΏΡ€ΠΈΠΊΠ»Π°Π΄Π½ΡƒΡŽ ΠΈΠ½Ρ‚Π΅Π»Π»Π΅ΠΊΡ‚ΡƒΠ°Π»ΡŒΠ½ΡƒΡŽ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½ΡƒΡŽ систСму (ПИИБ), Π½Π° соотвСтствиС хранящСйся Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ². Π’ ПИИБ, ΠΎΠ±Ρ€Π°Π±Π°Ρ‚Ρ‹Π²Π°ΡŽΡ‰ΠΈΡ… ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Π΅ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρ‹, Π½Π° ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄Ρ‹Π²Π°ΡŽΡ‚ΡΡ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΠΈΠ΅ ограничСния: высокая Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ, Π²Ρ‹Ρ‡ΠΈΡΠ»ΠΈΡ‚Π΅Π»ΡŒΠ½Π°Ρ ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ, Π²ΠΎΡΠΏΡ€ΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ², Π° Ρ‚Π°ΠΊΠΆΠ΅ ΠΎΠ±ΡŠΡΡΠ½ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ. Π˜ΡΡΠ»Π΅Π΄ΡƒΡŽΡ‚ΡΡ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹, ΡƒΠ΄ΠΎΠ²Π»Π΅Ρ‚Π²ΠΎΡ€ΡΡŽΡ‰ΠΈΠ΅ этим условиям. Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ изучаСтся Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ ΠΎΡ†Π΅Π½ΠΊΠΈ тСкстовых Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ шкалС Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΏΡƒΡ‚Π΅ΠΌ внСдрСния Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΡŽ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π°. ΠŸΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π° стратСгия обнаруТСния Π² Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ‹ ΠΏΠΎ ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΡŽ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ, ΠΏΡ€Π΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡŽΡ‰Π°Ρ обоснованный ΠΏΠΎΠ΄Π±ΠΎΡ€ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². Показано, ΠΊΠ°ΠΊ Π½Π° Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ влияСт Π²Ρ‹Π±ΠΎΡ€ Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΏΡ€ΠΈΠ½Ρ†ΠΈΠΏΠΎΠ² Ρ‚ΠΎΠΊΠ΅Π½ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² сниТСния размСрности ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ЭкспСримСнт ΠΏΡ€ΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡƒΡ… ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… коллСкциях Π½ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π½ΠΎ-тСхничСских Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²: стандартов Π² ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΈ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² сфСрС ΠΆΠ΅Π»Π΅Π·Π½Ρ‹Ρ… Π΄ΠΎΡ€ΠΎΠ³. Использовались ΠΏΠΎΠ΄Ρ…ΠΎΠ΄Ρ‹: вычислСниС индСкса Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΊΠ°ΠΊ расстояния Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅Ρ€Π° ΠΌΠ΅ΠΆΠ΄Ρƒ распрСдСлСниями близости Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΊ Ρ†Π΅Π½Ρ‚Ρ€Ρƒ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΌΡƒ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρƒ; оптимизация Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² зависимости ΠΎΡ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈ сниТСния размСрности. Π’Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ пространство ΡΡ‚Ρ€ΠΎΠΈΠ»ΠΎΡΡŒ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ прСобразования TF-IDF ΠΈ тСматичСского модСлирования ARTM. Π’Π΅ΡΡ‚ΠΈΡ€ΠΎΠ²Π°Π»ΠΈΡΡŒ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΡ‹ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰ΠΈΠΉ лСс), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡŒΠ½Ρ‹ΠΉ Ρ„Π°ΠΊΡ‚ΠΎΡ€ выброса), OneClass SVM (Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ²). ЭкспСримСнт ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€Π΄ΠΈΠ» ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½ΠΎΠΉ стратСгии для опрСдСлСния подходящСго ΠΌΠ΅Ρ‚ΠΎΠ΄Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ для Π·Π°Π΄Π°Π½Π½ΠΎΠΉ тСкстовой ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ. ΠŸΡ€ΠΈ поискС Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² Ρ€Π°ΠΌΠΊΠ°Ρ… тСматичСской кластСризации ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² эффСктивСн ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰Π΅Π³ΠΎ лСса. ΠŸΡ€ΠΈ Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ TF-IDF цСлСсообразно ΠΏΠΎΠ΄ΠΎΠ±Ρ€Π°Ρ‚ΡŒ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹Π΅ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ словаря ΠΈ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ² с ΡΠΎΠΎΡ‚Π²Π΅Ρ‚ΡΡ‚Π²ΡƒΡŽΡ‰Π΅ΠΉ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠ΅ΠΉ прСобразования ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ пространства
    corecore