11 research outputs found

    The Nature of Novelty Detection

    Full text link
    Sentence level novelty detection aims at reducing redundant sentences from a sentence list. In the task, sentences appearing later in the list with no new meanings are eliminated. Aiming at a better accuracy for detecting redundancy, this paper reveals the nature of the novelty detection task currently overlooked by the Novelty community βˆ’- Novelty as a combination of the partial overlap (PO, two sentences sharing common facts) and complete overlap (CO, the first sentence covers all the facts of the second sentence) relations. By formalizing novelty detection as a combination of the two relations between sentences, new viewpoints toward techniques dealing with Novelty are proposed. Among the methods discussed, the similarity, overlap, pool and language modeling approaches are commonly used. Furthermore, a novel approach, selected pool method is provided, which is immediate following the nature of the task. Experimental results obtained on all the three currently available novelty datasets showed that selected pool is significantly better or no worse than the current methods. Knowledge about the nature of the task also affects the evaluation methodologies. We propose new evaluation measures for Novelty according to the nature of the task, as well as possible directions for future study.Comment: This paper pointed out the future direction for novelty detection research. 37 pages, double spaced versio

    Novel and topical business news and their impact on stock market activities

    Full text link
    We propose an indicator to measure the degree to which a particular news article is novel, as well as an indicator to measure the degree to which a particular news item attracts attention from investors. The novelty measure is obtained by comparing the extent to which a particular news article is similar to earlier news articles, and an article is regarded as novel if there was no similar article before it. On the other hand, we say a news item receives a lot of attention and thus is highly topical if it is simultaneously reported by many news agencies and read by many investors who receive news from those agencies. The topicality measure for a news item is obtained by counting the number of news articles whose content is similar to an original news article but which are delivered by other news agencies. To check the performance of the indicators, we empirically examine how these indicators are correlated with intraday financial market indicators such as the number of transactions and price volatility. Specifically, we use a dataset consisting of over 90 million business news articles reported in English and a dataset consisting of minute-by-minute stock prices on the New York Stock Exchange and the NASDAQ Stock Market from 2003 to 2014, and show that stock prices and transaction volumes exhibited a significant response to a news article when it is novel and topical.Comment: 8 pages, 6 figures, 2 table

    Exploring the technical challenges of large-scale lifelogging

    Get PDF
    Ambiently and automatically maintaining a lifelog is an activity that may help individuals track their lifestyle, learning, health and productivity. In this paper we motivate and discuss the technical challenges of developing real-world lifelogging solutions, based on seven years of experience. The gathering, organisation, retrieval and presentation challenges of large-scale lifelogging are dis- cussed and we show how this can be achieved and the benefits that may accrue

    Novelty Detection by Latent Semantic Indexing

    Get PDF
    As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure

    ΠžΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½Ρ‹ΠΉ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ ΠΊ Π²Ρ‹Π±ΠΎΡ€Ρƒ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… тСкстовых коллСкциях

    Get PDF
    РассматриваСтся Π·Π°Π΄Π°Ρ‡Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½Ρ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² Π² тСкстовых коллСкциях. Π‘ΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΠ΅ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ выявлСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡƒΠ½ΠΈΠ²Π΅Ρ€ΡΠ°Π»ΡŒΠ½Ρ‹ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°ΡŽΡ‚ ΡΡ‚Π°Π±ΠΈΠ»ΡŒΠ½Ρ‹ΠΉ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ Π½Π° Ρ€Π°Π·Π½Ρ‹Ρ… Π½Π°Π±ΠΎΡ€Π°Ρ… Π΄Π°Π½Π½Ρ‹Ρ…. Π’ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ² зависит ΠΎΡ‚ Π²Ρ‹Π±ΠΎΡ€Π° ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· шагов Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ°, ΠΈ для Ρ€Π°Π·Π½Ρ‹Ρ… ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ Π½Π°Π±ΠΎΡ€Ρ‹ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². НС всС ΠΈΠ· ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΡ… Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ эффСктивно Ρ€Π°Π±ΠΎΡ‚Π°ΡŽΡ‚ с тСкстовыми Π΄Π°Π½Π½Ρ‹ΠΌΠΈ, Π²Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ прСдставлСниС ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… характСризуСтся большой Ρ€Π°Π·ΠΌΠ΅Ρ€Π½ΠΎΡΡ‚ΡŒΡŽ ΠΏΡ€ΠΈ сильной разрСТСнности. Π—Π°Π΄Π°Ρ‡Π° поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ рассматриваСтся Π² ΡΠ»Π΅Π΄ΡƒΡŽΡ‰Π΅ΠΉ постановкС: трСбуСтся ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΈΡ‚ΡŒ Π½ΠΎΠ²Ρ‹ΠΉ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚, Π·Π°Π³Ρ€ΡƒΠΆΠ°Π΅ΠΌΡ‹ΠΉ Π² ΠΏΡ€ΠΈΠΊΠ»Π°Π΄Π½ΡƒΡŽ ΠΈΠ½Ρ‚Π΅Π»Π»Π΅ΠΊΡ‚ΡƒΠ°Π»ΡŒΠ½ΡƒΡŽ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½ΡƒΡŽ систСму (ПИИБ), Π½Π° соотвСтствиС хранящСйся Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ². Π’ ПИИБ, ΠΎΠ±Ρ€Π°Π±Π°Ρ‚Ρ‹Π²Π°ΡŽΡ‰ΠΈΡ… ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Π΅ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρ‹, Π½Π° ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄Ρ‹Π²Π°ΡŽΡ‚ΡΡ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΠΈΠ΅ ограничСния: высокая Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ, Π²Ρ‹Ρ‡ΠΈΡΠ»ΠΈΡ‚Π΅Π»ΡŒΠ½Π°Ρ ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ, Π²ΠΎΡΠΏΡ€ΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ², Π° Ρ‚Π°ΠΊΠΆΠ΅ ΠΎΠ±ΡŠΡΡΠ½ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ. Π˜ΡΡΠ»Π΅Π΄ΡƒΡŽΡ‚ΡΡ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹, ΡƒΠ΄ΠΎΠ²Π»Π΅Ρ‚Π²ΠΎΡ€ΡΡŽΡ‰ΠΈΠ΅ этим условиям. Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ изучаСтся Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ ΠΎΡ†Π΅Π½ΠΊΠΈ тСкстовых Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ шкалС Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΏΡƒΡ‚Π΅ΠΌ внСдрСния Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΡŽ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π°. ΠŸΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π° стратСгия обнаруТСния Π² Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ‹ ΠΏΠΎ ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΡŽ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ, ΠΏΡ€Π΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡŽΡ‰Π°Ρ обоснованный ΠΏΠΎΠ΄Π±ΠΎΡ€ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². Показано, ΠΊΠ°ΠΊ Π½Π° Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ влияСт Π²Ρ‹Π±ΠΎΡ€ Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΏΡ€ΠΈΠ½Ρ†ΠΈΠΏΠΎΠ² Ρ‚ΠΎΠΊΠ΅Π½ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² сниТСния размСрности ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ЭкспСримСнт ΠΏΡ€ΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡƒΡ… ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… коллСкциях Π½ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π½ΠΎ-тСхничСских Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²: стандартов Π² ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΈ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² сфСрС ΠΆΠ΅Π»Π΅Π·Π½Ρ‹Ρ… Π΄ΠΎΡ€ΠΎΠ³. Использовались ΠΏΠΎΠ΄Ρ…ΠΎΠ΄Ρ‹: вычислСниС индСкса Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΊΠ°ΠΊ расстояния Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅Ρ€Π° ΠΌΠ΅ΠΆΠ΄Ρƒ распрСдСлСниями близости Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΊ Ρ†Π΅Π½Ρ‚Ρ€Ρƒ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΌΡƒ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρƒ; оптимизация Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² зависимости ΠΎΡ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈ сниТСния размСрности. Π’Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ пространство ΡΡ‚Ρ€ΠΎΠΈΠ»ΠΎΡΡŒ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ прСобразования TF-IDF ΠΈ тСматичСского модСлирования ARTM. Π’Π΅ΡΡ‚ΠΈΡ€ΠΎΠ²Π°Π»ΠΈΡΡŒ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΡ‹ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰ΠΈΠΉ лСс), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡŒΠ½Ρ‹ΠΉ Ρ„Π°ΠΊΡ‚ΠΎΡ€ выброса), OneClass SVM (Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ²). ЭкспСримСнт ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€Π΄ΠΈΠ» ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½ΠΎΠΉ стратСгии для опрСдСлСния подходящСго ΠΌΠ΅Ρ‚ΠΎΠ΄Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ для Π·Π°Π΄Π°Π½Π½ΠΎΠΉ тСкстовой ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ. ΠŸΡ€ΠΈ поискС Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² Ρ€Π°ΠΌΠΊΠ°Ρ… тСматичСской кластСризации ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² эффСктивСн ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰Π΅Π³ΠΎ лСса. ΠŸΡ€ΠΈ Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ TF-IDF цСлСсообразно ΠΏΠΎΠ΄ΠΎΠ±Ρ€Π°Ρ‚ΡŒ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹Π΅ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ словаря ΠΈ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ² с ΡΠΎΠΎΡ‚Π²Π΅Ρ‚ΡΡ‚Π²ΡƒΡŽΡ‰Π΅ΠΉ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠ΅ΠΉ прСобразования ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ пространства

    ΠžΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½Ρ‹ΠΉ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ ΠΊ Π²Ρ‹Π±ΠΎΡ€Ρƒ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… тСкстовых коллСкциях

    Get PDF
    The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity. The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated. The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms. The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine). The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.РассматриваСтся Π·Π°Π΄Π°Ρ‡Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½Ρ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² Π² тСкстовых коллСкциях. Π‘ΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΠ΅ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ выявлСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡƒΠ½ΠΈΠ²Π΅Ρ€ΡΠ°Π»ΡŒΠ½Ρ‹ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°ΡŽΡ‚ ΡΡ‚Π°Π±ΠΈΠ»ΡŒΠ½Ρ‹ΠΉ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ Π½Π° Ρ€Π°Π·Π½Ρ‹Ρ… Π½Π°Π±ΠΎΡ€Π°Ρ… Π΄Π°Π½Π½Ρ‹Ρ…. Π’ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ² зависит ΠΎΡ‚ Π²Ρ‹Π±ΠΎΡ€Π° ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· шагов Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ°, ΠΈ для Ρ€Π°Π·Π½Ρ‹Ρ… ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ Π½Π°Π±ΠΎΡ€Ρ‹ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². НС всС ΠΈΠ· ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΡ… Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ эффСктивно Ρ€Π°Π±ΠΎΡ‚Π°ΡŽΡ‚ с тСкстовыми Π΄Π°Π½Π½Ρ‹ΠΌΠΈ, Π²Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ прСдставлСниС ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… характСризуСтся большой Ρ€Π°Π·ΠΌΠ΅Ρ€Π½ΠΎΡΡ‚ΡŒΡŽ ΠΏΡ€ΠΈ сильной разрСТСнности. Π—Π°Π΄Π°Ρ‡Π° поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ рассматриваСтся Π² ΡΠ»Π΅Π΄ΡƒΡŽΡ‰Π΅ΠΉ постановкС: трСбуСтся ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΈΡ‚ΡŒ Π½ΠΎΠ²Ρ‹ΠΉ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚, Π·Π°Π³Ρ€ΡƒΠΆΠ°Π΅ΠΌΡ‹ΠΉ Π² ΠΏΡ€ΠΈΠΊΠ»Π°Π΄Π½ΡƒΡŽ ΠΈΠ½Ρ‚Π΅Π»Π»Π΅ΠΊΡ‚ΡƒΠ°Π»ΡŒΠ½ΡƒΡŽ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½ΡƒΡŽ систСму (ПИИБ), Π½Π° соотвСтствиС хранящСйся Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ². Π’ ПИИБ, ΠΎΠ±Ρ€Π°Π±Π°Ρ‚Ρ‹Π²Π°ΡŽΡ‰ΠΈΡ… ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Π΅ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρ‹, Π½Π° ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄Ρ‹Π²Π°ΡŽΡ‚ΡΡ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΠΈΠ΅ ограничСния: высокая Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ, Π²Ρ‹Ρ‡ΠΈΡΠ»ΠΈΡ‚Π΅Π»ΡŒΠ½Π°Ρ ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ, Π²ΠΎΡΠΏΡ€ΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ², Π° Ρ‚Π°ΠΊΠΆΠ΅ ΠΎΠ±ΡŠΡΡΠ½ΠΈΠΌΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ. Π˜ΡΡΠ»Π΅Π΄ΡƒΡŽΡ‚ΡΡ ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹, ΡƒΠ΄ΠΎΠ²Π»Π΅Ρ‚Π²ΠΎΡ€ΡΡŽΡ‰ΠΈΠ΅ этим условиям. Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ изучаСтся Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ ΠΎΡ†Π΅Π½ΠΊΠΈ тСкстовых Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ шкалС Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΏΡƒΡ‚Π΅ΠΌ внСдрСния Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΡŽ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π°. ΠŸΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π° стратСгия обнаруТСния Π² Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ‹ ΠΏΠΎ ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΡŽ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ, ΠΏΡ€Π΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡŽΡ‰Π°Ρ обоснованный ΠΏΠΎΠ΄Π±ΠΎΡ€ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ². Показано, ΠΊΠ°ΠΊ Π½Π° Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ влияСт Π²Ρ‹Π±ΠΎΡ€ Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΏΡ€ΠΈΠ½Ρ†ΠΈΠΏΠΎΠ² Ρ‚ΠΎΠΊΠ΅Π½ΠΈΠ·Π°Ρ†ΠΈΠΈ, ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² сниТСния размСрности ΠΈ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ΠΎΠ² Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ЭкспСримСнт ΠΏΡ€ΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡƒΡ… ΠΎΠ΄Π½ΠΎΡ€ΠΎΠ΄Π½Ρ‹Ρ… коллСкциях Π½ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠ²Π½ΠΎ-тСхничСских Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ²: стандартов Π² ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΈ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² сфСрС ΠΆΠ΅Π»Π΅Π·Π½Ρ‹Ρ… Π΄ΠΎΡ€ΠΎΠ³. Использовались ΠΏΠΎΠ΄Ρ…ΠΎΠ΄Ρ‹: вычислСниС индСкса Π°Π½ΠΎΠΌΠ°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ ΠΊΠ°ΠΊ расстояния Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅Ρ€Π° ΠΌΠ΅ΠΆΠ΄Ρƒ распрСдСлСниями близости Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΊ Ρ†Π΅Π½Ρ‚Ρ€Ρƒ ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡ€ΠΎΠ΄Π½ΠΎΠΌΡƒ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Ρƒ; оптимизация Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² поиска Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² зависимости ΠΎΡ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ ΠΈ сниТСния размСрности. Π’Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ пространство ΡΡ‚Ρ€ΠΎΠΈΠ»ΠΎΡΡŒ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ прСобразования TF-IDF ΠΈ тСматичСского модСлирования ARTM. Π’Π΅ΡΡ‚ΠΈΡ€ΠΎΠ²Π°Π»ΠΈΡΡŒ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΡ‹ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰ΠΈΠΉ лСс), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡŒΠ½Ρ‹ΠΉ Ρ„Π°ΠΊΡ‚ΠΎΡ€ выброса), OneClass SVM (Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ²). ЭкспСримСнт ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€Π΄ΠΈΠ» ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½Π½ΠΎΠΉ стратСгии для опрСдСлСния подходящСго ΠΌΠ΅Ρ‚ΠΎΠ΄Π° обнаруТСния Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ для Π·Π°Π΄Π°Π½Π½ΠΎΠΉ тСкстовой ΠΊΠΎΠ»Π»Π΅ΠΊΡ†ΠΈΠΈ. ΠŸΡ€ΠΈ поискС Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² Ρ€Π°ΠΌΠΊΠ°Ρ… тСматичСской кластСризации ΡŽΡ€ΠΈΠ΄ΠΈΡ‡Π΅ΡΠΊΠΈ Π·Π½Π°Ρ‡ΠΈΠΌΡ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² эффСктивСн ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡ€ΡƒΡŽΡ‰Π΅Π³ΠΎ лСса. ΠŸΡ€ΠΈ Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΈΠ·Π°Ρ†ΠΈΠΈ Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² ΠΏΠΎ TF-IDF цСлСсообразно ΠΏΠΎΠ΄ΠΎΠ±Ρ€Π°Ρ‚ΡŒ ΠΎΠΏΡ‚ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹Π΅ ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ словаря ΠΈ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ ΠΌΠ΅Ρ‚ΠΎΠ΄ ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ² с ΡΠΎΠΎΡ‚Π²Π΅Ρ‚ΡΡ‚Π²ΡƒΡŽΡ‰Π΅ΠΉ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠ΅ΠΉ прСобразования ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ пространства

    Novelty detection in video retrieval: finding new news in TV news stories

    Get PDF
    Novelty detection is defined as the detection of documents that provide "new" or previously unseen information. "New information" in a search result list is defined as the incremental information found in a document based on what the user has already learned from reviewing previous documents in a given ranked list of documents. It is assumed that, as a user views a list of documents, their information need changes or evolves, and their state of knowledge increases as they gain new information from the documents they see. The automatic detection of "novelty" , or newness, as part of an information retrieval system could greatly improve a searcher’s experience by presenting "documents" in order of how much extra information they add to what is already known, instead of how similar they are to a user’s query. This could be particularly useful in applications such as the search of broadcast news and automatic summary generation. There are many different aspects of information management, however, this thesis, presents research into the area of novelty detection within the content based video domain. It explores the benefits of integrating the many multi modal resources associated with video content those of low level feature detection evidences such as colour and edge, automatic concepts detections such as face, commercials, and anchor person, automatic speech recognition transcripts and manually annotated MPEG7 concepts into a novelty detection model. The effectiveness of this novelty detection model is evaluated on a collection of TV new data

    Intelligent Learning Automata-based Strategies Applied to Personalized Service Provisioning in Pervasive Environments

    Get PDF
    Doktorgradsavhandling i informasjons- og kommunikasjonsteknologi, Universitetet i Agder, Grimstad, 201
    corecore