11 research outputs found
The Nature of Novelty Detection
Sentence level novelty detection aims at reducing redundant sentences from a
sentence list. In the task, sentences appearing later in the list with no new
meanings are eliminated. Aiming at a better accuracy for detecting redundancy,
this paper reveals the nature of the novelty detection task currently
overlooked by the Novelty community Novelty as a combination of the partial
overlap (PO, two sentences sharing common facts) and complete overlap (CO, the
first sentence covers all the facts of the second sentence) relations. By
formalizing novelty detection as a combination of the two relations between
sentences, new viewpoints toward techniques dealing with Novelty are proposed.
Among the methods discussed, the similarity, overlap, pool and language
modeling approaches are commonly used. Furthermore, a novel approach, selected
pool method is provided, which is immediate following the nature of the task.
Experimental results obtained on all the three currently available novelty
datasets showed that selected pool is significantly better or no worse than the
current methods. Knowledge about the nature of the task also affects the
evaluation methodologies. We propose new evaluation measures for Novelty
according to the nature of the task, as well as possible directions for future
study.Comment: This paper pointed out the future direction for novelty detection
research. 37 pages, double spaced versio
Novel and topical business news and their impact on stock market activities
We propose an indicator to measure the degree to which a particular news
article is novel, as well as an indicator to measure the degree to which a
particular news item attracts attention from investors. The novelty measure is
obtained by comparing the extent to which a particular news article is similar
to earlier news articles, and an article is regarded as novel if there was no
similar article before it. On the other hand, we say a news item receives a lot
of attention and thus is highly topical if it is simultaneously reported by
many news agencies and read by many investors who receive news from those
agencies. The topicality measure for a news item is obtained by counting the
number of news articles whose content is similar to an original news article
but which are delivered by other news agencies. To check the performance of the
indicators, we empirically examine how these indicators are correlated with
intraday financial market indicators such as the number of transactions and
price volatility. Specifically, we use a dataset consisting of over 90 million
business news articles reported in English and a dataset consisting of
minute-by-minute stock prices on the New York Stock Exchange and the NASDAQ
Stock Market from 2003 to 2014, and show that stock prices and transaction
volumes exhibited a significant response to a news article when it is novel and
topical.Comment: 8 pages, 6 figures, 2 table
Exploring the technical challenges of large-scale lifelogging
Ambiently and automatically maintaining a lifelog is an activity that may help individuals track their lifestyle, learning, health and productivity. In this paper we motivate and discuss the technical challenges of developing real-world lifelogging solutions, based on seven years of experience. The gathering, organisation, retrieval and presentation challenges of large-scale lifelogging are dis- cussed and we show how this can be achieved and the benefits that may accrue
Novelty Detection by Latent Semantic Indexing
As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources.
To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected.
We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure
ΠΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΡΠΉ ΠΏΠΎΠ΄Ρ ΠΎΠ΄ ΠΊ Π²ΡΠ±ΠΎΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
Π Π°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π·Π°Π΄Π°ΡΠ° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² Π² ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
. Π‘ΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΠ΅ ΠΌΠ΅ΡΠΎΠ΄Ρ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡΠ½ΠΈΠ²Π΅ΡΡΠ°Π»ΡΠ½Ρ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°ΡΡ ΡΡΠ°Π±ΠΈΠ»ΡΠ½ΡΠΉ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ Π½Π° ΡΠ°Π·Π½ΡΡ
Π½Π°Π±ΠΎΡΠ°Ρ
Π΄Π°Π½Π½ΡΡ
. Π’ΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² Π·Π°Π²ΠΈΡΠΈΡ ΠΎΡ Π²ΡΠ±ΠΎΡΠ° ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· ΡΠ°Π³ΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠ°, ΠΈ Π΄Π»Ρ ΡΠ°Π·Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΉ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ Π½Π°Π±ΠΎΡΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠ΅ Π²ΡΠ΅ ΠΈΠ· ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ ΡΠ°Π±ΠΎΡΠ°ΡΡ Ρ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΠΌΠΈ Π΄Π°Π½Π½ΡΠΌΠΈ, Π²Π΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΠΊΠΎΡΠΎΡΡΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΠ·ΡΠ΅ΡΡΡ Π±ΠΎΠ»ΡΡΠΎΠΉ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΡΡ ΠΏΡΠΈ ΡΠΈΠ»ΡΠ½ΠΎΠΉ ΡΠ°Π·ΡΠ΅ΠΆΠ΅Π½Π½ΠΎΡΡΠΈ. ΠΠ°Π΄Π°ΡΠ° ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π² ΡΠ»Π΅Π΄ΡΡΡΠ΅ΠΉ ΠΏΠΎΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅: ΡΡΠ΅Π±ΡΠ΅ΡΡΡ ΠΏΡΠΎΠ²Π΅ΡΠΈΡΡ Π½ΠΎΠ²ΡΠΉ Π΄ΠΎΠΊΡΠΌΠ΅Π½Ρ, Π·Π°Π³ΡΡΠΆΠ°Π΅ΠΌΡΠΉ Π² ΠΏΡΠΈΠΊΠ»Π°Π΄Π½ΡΡ ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ ΡΠΈΡΡΠ΅ΠΌΡ (ΠΠΠΠ‘), Π½Π° ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΠΈΠ΅ Ρ
ΡΠ°Π½ΡΡΠ΅ΠΉΡΡ Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ². Π ΠΠΠΠ‘, ΠΎΠ±ΡΠ°Π±Π°ΡΡΠ²Π°ΡΡΠΈΡ
ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΠ΅ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ, Π½Π° ΠΌΠ΅ΡΠΎΠ΄Ρ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄ΡΠ²Π°ΡΡΡΡ ΡΠ»Π΅Π΄ΡΡΡΠΈΠ΅ ΠΎΠ³ΡΠ°Π½ΠΈΡΠ΅Π½ΠΈΡ: Π²ΡΡΠΎΠΊΠ°Ρ ΡΠΎΡΠ½ΠΎΡΡΡ, Π²ΡΡΠΈΡΠ»ΠΈΡΠ΅Π»ΡΠ½Π°Ρ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ, Π²ΠΎΡΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ², Π° ΡΠ°ΠΊΠΆΠ΅ ΠΎΠ±ΡΡΡΠ½ΠΈΠΌΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ. ΠΡΡΠ»Π΅Π΄ΡΡΡΡΡ ΠΌΠ΅ΡΠΎΠ΄Ρ, ΡΠ΄ΠΎΠ²Π»Π΅ΡΠ²ΠΎΡΡΡΡΠΈΠ΅ ΡΡΠΈΠΌ ΡΡΠ»ΠΎΠ²ΠΈΡΠΌ. Π ΡΠ°Π±ΠΎΡΠ΅ ΠΈΠ·ΡΡΠ°Π΅ΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡ ΠΎΡΠ΅Π½ΠΊΠΈ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ ΡΠΊΠ°Π»Π΅ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΏΡΡΠ΅ΠΌ Π²Π½Π΅Π΄ΡΠ΅Π½ΠΈΡ Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ°. ΠΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° ΡΡΡΠ°ΡΠ΅Π³ΠΈΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π² Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ ΠΏΠΎ ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΡ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ, ΠΏΡΠ΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡΡΠ°Ρ ΠΎΠ±ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΡΠΉ ΠΏΠΎΠ΄Π±ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠΎΠΊΠ°Π·Π°Π½ΠΎ, ΠΊΠ°ΠΊ Π½Π° ΡΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π²Π»ΠΈΡΠ΅Ρ Π²ΡΠ±ΠΎΡ Π²Π°ΡΠΈΠ°Π½ΡΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ, ΠΏΡΠΈΠ½ΡΠΈΠΏΠΎΠ² ΡΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΈΠΈ, ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡΡ
ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
Π½ΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΠΎ-ΡΠ΅Ρ
Π½ΠΈΡΠ΅ΡΠΊΠΈΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ²: ΡΡΠ°Π½Π΄Π°ΡΡΠΎΠ² Π² ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² ΡΡΠ΅ΡΠ΅ ΠΆΠ΅Π»Π΅Π·Π½ΡΡ
Π΄ΠΎΡΠΎΠ³. ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π»ΠΈΡΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄Ρ: Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΠ΅ ΠΈΠ½Π΄Π΅ΠΊΡΠ° Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΊΠ°ΠΊ ΡΠ°ΡΡΡΠΎΡΠ½ΠΈΡ Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅ΡΠ° ΠΌΠ΅ΠΆΠ΄Ρ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡΠΌΠΈ Π±Π»ΠΈΠ·ΠΎΡΡΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΊ ΡΠ΅Π½ΡΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠΌΡ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ; ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ ΠΈ ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ. ΠΠ΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²ΠΎ ΡΡΡΠΎΠΈΠ»ΠΎΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ TF-IDF ΠΈ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ARTM. Π’Π΅ΡΡΠΈΡΠΎΠ²Π°Π»ΠΈΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠΈΠΉ Π»Π΅Ρ), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡΠ½ΡΠΉ ΡΠ°ΠΊΡΠΎΡ Π²ΡΠ±ΡΠΎΡΠ°), OneClass SVM (Π²Π°ΡΠΈΠ°Π½Ρ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ²). ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΠΎΠ΄ΡΠ²Π΅ΡΠ΄ΠΈΠ» ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΠΎΠΉ ΡΡΡΠ°ΡΠ΅Π³ΠΈΠΈ Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΡΡΠ΅Π³ΠΎ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π΄Π»Ρ Π·Π°Π΄Π°Π½Π½ΠΎΠΉ ΡΠ΅ΠΊΡΡΠΎΠ²ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ. ΠΡΠΈ ΠΏΠΎΠΈΡΠΊΠ΅ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² ΡΠ°ΠΌΠΊΠ°Ρ
ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΊΠ»Π°ΡΡΠ΅ΡΠΈΠ·Π°ΡΠΈΠΈ ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΡΡΡΠ΅ΠΊΡΠΈΠ²Π΅Π½ ΠΌΠ΅ΡΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠ΅Π³ΠΎ Π»Π΅ΡΠ°. ΠΡΠΈ Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ TF-IDF ΡΠ΅Π»Π΅ΡΠΎΠΎΠ±ΡΠ°Π·Π½ΠΎ ΠΏΠΎΠ΄ΠΎΠ±ΡΠ°ΡΡ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΡΠ΅ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΡ ΡΠ»ΠΎΠ²Π°ΡΡ ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ ΠΌΠ΅ΡΠΎΠ΄ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² Ρ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΡΡΠ΅ΠΉ ΡΡΠ½ΠΊΡΠΈΠ΅ΠΉ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²Π°
ΠΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΡΠΉ ΠΏΠΎΠ΄Ρ ΠΎΠ΄ ΠΊ Π²ΡΠ±ΠΎΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity. The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated. The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms. The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine). The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.Π Π°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π·Π°Π΄Π°ΡΠ° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² Π² ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
. Π‘ΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΠ΅ ΠΌΠ΅ΡΠΎΠ΄Ρ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π΅ ΡΠ½ΠΈΠ²Π΅ΡΡΠ°Π»ΡΠ½Ρ ΠΈ Π½Π΅ ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°ΡΡ ΡΡΠ°Π±ΠΈΠ»ΡΠ½ΡΠΉ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ Π½Π° ΡΠ°Π·Π½ΡΡ
Π½Π°Π±ΠΎΡΠ°Ρ
Π΄Π°Π½Π½ΡΡ
. Π’ΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² Π·Π°Π²ΠΈΡΠΈΡ ΠΎΡ Π²ΡΠ±ΠΎΡΠ° ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π½Π° ΠΊΠ°ΠΆΠ΄ΠΎΠΌ ΠΈΠ· ΡΠ°Π³ΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠ°, ΠΈ Π΄Π»Ρ ΡΠ°Π·Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΉ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ Π½Π°Π±ΠΎΡΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠ΅ Π²ΡΠ΅ ΠΈΠ· ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ ΡΠ°Π±ΠΎΡΠ°ΡΡ Ρ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΠΌΠΈ Π΄Π°Π½Π½ΡΠΌΠΈ, Π²Π΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΠΊΠΎΡΠΎΡΡΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΠ·ΡΠ΅ΡΡΡ Π±ΠΎΠ»ΡΡΠΎΠΉ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΡΡ ΠΏΡΠΈ ΡΠΈΠ»ΡΠ½ΠΎΠΉ ΡΠ°Π·ΡΠ΅ΠΆΠ΅Π½Π½ΠΎΡΡΠΈ. ΠΠ°Π΄Π°ΡΠ° ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π² ΡΠ»Π΅Π΄ΡΡΡΠ΅ΠΉ ΠΏΠΎΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅: ΡΡΠ΅Π±ΡΠ΅ΡΡΡ ΠΏΡΠΎΠ²Π΅ΡΠΈΡΡ Π½ΠΎΠ²ΡΠΉ Π΄ΠΎΠΊΡΠΌΠ΅Π½Ρ, Π·Π°Π³ΡΡΠΆΠ°Π΅ΠΌΡΠΉ Π² ΠΏΡΠΈΠΊΠ»Π°Π΄Π½ΡΡ ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ ΡΠΈΡΡΠ΅ΠΌΡ (ΠΠΠΠ‘), Π½Π° ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΠΈΠ΅ Ρ
ΡΠ°Π½ΡΡΠ΅ΠΉΡΡ Π² Π½Π΅ΠΉ ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ². Π ΠΠΠΠ‘, ΠΎΠ±ΡΠ°Π±Π°ΡΡΠ²Π°ΡΡΠΈΡ
ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΠ΅ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ, Π½Π° ΠΌΠ΅ΡΠΎΠ΄Ρ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π½Π°ΠΊΠ»Π°Π΄ΡΠ²Π°ΡΡΡΡ ΡΠ»Π΅Π΄ΡΡΡΠΈΠ΅ ΠΎΠ³ΡΠ°Π½ΠΈΡΠ΅Π½ΠΈΡ: Π²ΡΡΠΎΠΊΠ°Ρ ΡΠΎΡΠ½ΠΎΡΡΡ, Π²ΡΡΠΈΡΠ»ΠΈΡΠ΅Π»ΡΠ½Π°Ρ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ, Π²ΠΎΡΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΠΌΠΎΡΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ², Π° ΡΠ°ΠΊΠΆΠ΅ ΠΎΠ±ΡΡΡΠ½ΠΈΠΌΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ. ΠΡΡΠ»Π΅Π΄ΡΡΡΡΡ ΠΌΠ΅ΡΠΎΠ΄Ρ, ΡΠ΄ΠΎΠ²Π»Π΅ΡΠ²ΠΎΡΡΡΡΠΈΠ΅ ΡΡΠΈΠΌ ΡΡΠ»ΠΎΠ²ΠΈΡΠΌ. Π ΡΠ°Π±ΠΎΡΠ΅ ΠΈΠ·ΡΡΠ°Π΅ΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡ ΠΎΡΠ΅Π½ΠΊΠΈ ΡΠ΅ΠΊΡΡΠΎΠ²ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ ΡΠΊΠ°Π»Π΅ Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΏΡΡΠ΅ΠΌ Π²Π½Π΅Π΄ΡΠ΅Π½ΠΈΡ Π² ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡ Π·Π°Π²Π΅Π΄ΠΎΠΌΠΎ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠ³ΠΎ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ°. ΠΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π° ΡΡΡΠ°ΡΠ΅Π³ΠΈΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π² Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠ΅ Π½ΠΎΠ²ΠΈΠ·Π½Ρ ΠΏΠΎ ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΡ ΠΊ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ, ΠΏΡΠ΅Π΄ΠΏΠΎΠ»Π°Π³Π°ΡΡΠ°Ρ ΠΎΠ±ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΡΠΉ ΠΏΠΎΠ΄Π±ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ². ΠΠΎΠΊΠ°Π·Π°Π½ΠΎ, ΠΊΠ°ΠΊ Π½Π° ΡΠΎΡΠ½ΠΎΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π²Π»ΠΈΡΠ΅Ρ Π²ΡΠ±ΠΎΡ Π²Π°ΡΠΈΠ°Π½ΡΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ, ΠΏΡΠΈΠ½ΡΠΈΠΏΠΎΠ² ΡΠΎΠΊΠ΅Π½ΠΈΠ·Π°ΡΠΈΠΈ, ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ ΠΈ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ² Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ. ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½ Π½Π° Π΄Π²ΡΡ
ΠΎΠ΄Π½ΠΎΡΠΎΠ΄Π½ΡΡ
ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΡΡ
Π½ΠΎΡΠΌΠ°ΡΠΈΠ²Π½ΠΎ-ΡΠ΅Ρ
Π½ΠΈΡΠ΅ΡΠΊΠΈΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ²: ΡΡΠ°Π½Π΄Π°ΡΡΠΎΠ² Π² ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΈ Π² ΡΡΠ΅ΡΠ΅ ΠΆΠ΅Π»Π΅Π·Π½ΡΡ
Π΄ΠΎΡΠΎΠ³. ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π»ΠΈΡΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄Ρ: Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΠ΅ ΠΈΠ½Π΄Π΅ΠΊΡΠ° Π°Π½ΠΎΠΌΠ°Π»ΡΠ½ΠΎΡΡΠΈ ΠΊΠ°ΠΊ ΡΠ°ΡΡΡΠΎΡΠ½ΠΈΡ Π₯Π΅Π»Π»ΠΈΠ½Π³Π΅ΡΠ° ΠΌΠ΅ΠΆΠ΄Ρ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡΠΌΠΈ Π±Π»ΠΈΠ·ΠΎΡΡΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΊ ΡΠ΅Π½ΡΡΡ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ ΠΈ ΠΊ ΠΈΠ½ΠΎΡΠΎΠ΄Π½ΠΎΠΌΡ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΡ; ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΏΠΎΠΈΡΠΊΠ° Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π² Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΎΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ ΠΈ ΡΠ½ΠΈΠΆΠ΅Π½ΠΈΡ ΡΠ°Π·ΠΌΠ΅ΡΠ½ΠΎΡΡΠΈ. ΠΠ΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²ΠΎ ΡΡΡΠΎΠΈΠ»ΠΎΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ TF-IDF ΠΈ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΌΠΎΠ΄Π΅Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΡ ARTM. Π’Π΅ΡΡΠΈΡΠΎΠ²Π°Π»ΠΈΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ Isolation Forest (ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠΈΠΉ Π»Π΅Ρ), Local Outlier Factor (Π»ΠΎΠΊΠ°Π»ΡΠ½ΡΠΉ ΡΠ°ΠΊΡΠΎΡ Π²ΡΠ±ΡΠΎΡΠ°), OneClass SVM (Π²Π°ΡΠΈΠ°Π½Ρ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ²). ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½Ρ ΠΏΠΎΠ΄ΡΠ²Π΅ΡΠ΄ΠΈΠ» ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠΉ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½Π½ΠΎΠΉ ΡΡΡΠ°ΡΠ΅Π³ΠΈΠΈ Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΡΡΠ΅Π³ΠΎ ΠΌΠ΅ΡΠΎΠ΄Π° ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΉ Π΄Π»Ρ Π·Π°Π΄Π°Π½Π½ΠΎΠΉ ΡΠ΅ΠΊΡΡΠΎΠ²ΠΎΠΉ ΠΊΠΎΠ»Π»Π΅ΠΊΡΠΈΠΈ. ΠΡΠΈ ΠΏΠΎΠΈΡΠΊΠ΅ Π°Π½ΠΎΠΌΠ°Π»ΠΈΠΈ Π² ΡΠ°ΠΌΠΊΠ°Ρ
ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΊΠ»Π°ΡΡΠ΅ΡΠΈΠ·Π°ΡΠΈΠΈ ΡΡΠΈΠ΄ΠΈΡΠ΅ΡΠΊΠΈ Π·Π½Π°ΡΠΈΠΌΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΡΡΡΠ΅ΠΊΡΠΈΠ²Π΅Π½ ΠΌΠ΅ΡΠΎΠ΄ ΠΈΠ·ΠΎΠ»ΠΈΡΡΡΡΠ΅Π³ΠΎ Π»Π΅ΡΠ°. ΠΡΠΈ Π²Π΅ΠΊΡΠΎΡΠΈΠ·Π°ΡΠΈΠΈ Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎ TF-IDF ΡΠ΅Π»Π΅ΡΠΎΠΎΠ±ΡΠ°Π·Π½ΠΎ ΠΏΠΎΠ΄ΠΎΠ±ΡΠ°ΡΡ ΠΎΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΡΠ΅ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΡ ΡΠ»ΠΎΠ²Π°ΡΡ ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ ΠΌΠ΅ΡΠΎΠ΄ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² Ρ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΡΡΠ΅ΠΉ ΡΡΠ½ΠΊΡΠΈΠ΅ΠΉ ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ²ΠΎΠ³ΠΎ ΠΏΡΠΎΡΡΡΠ°Π½ΡΡΠ²Π°
Novelty detection in video retrieval: finding new news in TV news stories
Novelty detection is defined as the detection of documents that provide "new" or previously unseen information. "New information" in a search result list is defined as the incremental information found in a document based on what the user has already learned from reviewing previous documents in a given ranked list of documents. It is assumed that, as a user views a list of documents, their information need changes or evolves, and their state of knowledge increases as they gain new information from the documents they see. The automatic detection of "novelty" , or newness, as part of an information retrieval system could greatly improve a searcherβs experience by presenting "documents" in order of how much extra information they add to what is already known, instead of how similar they are to a userβs query. This could be particularly useful in applications such as the search of broadcast news and automatic summary generation.
There are many different aspects of information management, however, this thesis, presents research into the area of novelty detection within the content based video domain. It explores the benefits of integrating the many multi modal resources associated with video content those of low level feature detection evidences such as colour and edge, automatic concepts detections such as face, commercials, and anchor person, automatic speech recognition transcripts and manually annotated MPEG7 concepts into a novelty detection model. The effectiveness of this novelty detection model is evaluated on a collection of TV new data
Intelligent Learning Automata-based Strategies Applied to Personalized Service Provisioning in Pervasive Environments
Doktorgradsavhandling i informasjons- og kommunikasjonsteknologi, Universitetet i Agder, Grimstad, 201