1,161 research outputs found

    Detecting psycho-anomalies on the world-wide web: current tools and challenges

    Get PDF
    The rise of the use of Social Media and the overall progress of technology has unfortunately opened new ways for criminals such as paedophiles, serial killers and rapists to exploit the powers that the technology offers in order to lure potential victims. It is of great need to be able to detect extreme criminal behaviours on the World-Wide Web and take measures to protect the general public from the effects of such behaviours. The aim of this chapter is to examine the current data analysis tools and technologies that are used to detect extreme online criminal behaviour and the challenges that exist associated with the use of these technologies. Specific emphasis is given to extreme criminal behaviours such as paedophilia and serial killing as these are considered the most dangerous behaviours. A number of conclusions are drawn in relation to the use and challenges of technological means in order to face such criminal behaviours

    Proceedings of the Third Dutch-Belgian Information Retrieval Workshop (DIR 2002)

    Get PDF

    Document analysis by means of data mining techniques

    Get PDF
    The huge amount of textual data produced everyday by scientists, journalists and Web users, allows investigating many different aspects of information stored in the published documents. Data mining and information retrieval techniques are exploited to manage and extract information from huge amount of unstructured textual data. Text mining also known as text data mining is the processing of extracting high quality information (focusing relevance, novelty and interestingness) from text by identifying patterns etc. Text mining typically involves the process of structuring input text by means of parsing and other linguistic features or sometimes by removing extra data and then finding patterns from structured data. Patterns are then evaluated at last and interpretation of output is performed to accomplish the desired task. Recently, text mining has got attention in several fields such as in security (involves analysis of Internet news), for commercial (for search and indexing purposes) and in academic departments (such as answering query). Beyond searching the documents consisting the words given in a user query, text mining may provide direct answer to user by semantic web for content based (content meaning and its context). It can also act as intelligence analyst and can also be used in some email spam filters for filtering out unwanted material. Text mining usually includes tasks such as clustering, categorization, sentiment analysis, entity recognition, entity relation modeling and document summarization. In particular, summarization approaches are suitable for identifying relevant sentences that describe the main concepts presented in a document dataset. Furthermore, the knowledge existed in the most informative sentences can be employed to improve the understanding of user and/or community interests. Different approaches have been proposed to extract summaries from unstructured text documents. Some of them are based on the statistical analysis of linguistic features by means of supervised machine learning or data mining methods, such as Hidden Markov models, neural networks and Naive Bayes methods. An appealing research field is the extraction of summaries tailored to the major user interests. In this context, the problem of extracting useful information according to domain knowledge related to the user interests is a challenging task. The main topics have been to study and design of novel data representations and data mining algorithms useful for managing and extracting knowledge from unstructured documents. This thesis describes an effort to investigate the application of data mining approaches, firmly established in the subject of transactional data (e.g., frequent itemset mining), to textual documents. Frequent itemset mining is a widely exploratory technique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, the usage of frequent itemsets in textual document summarization has never been investigated so far. A work is carried on exploiting frequent itemsets for the purpose of multi-document summarization so a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer) is presented, that is based on an itemset-based model, i.e., a framework comprise of frequent itemsets, taken out from the document collection. Highly representative and not redundant sentences are selected for generating summary by considering both sentence coverage, with respect to a sentence relevance score, based on tf-idf statistics, and a concise and highly informative itemset-based model. To evaluate the ItemSum performance a suite of experiments on a collection of news articles has been performed. Obtained results show that ItemSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. We also validated our approach against a large number of approaches on the DUC’04 document collection. Performance comparisons, in terms of precision, recall, and F-measure, have been performed by means of the ROUGE toolkit. In most cases, ItemSum significantly outperforms the considered competitors. Furthermore, the impact of both the main algorithm parameters and the adopted model coverage strategy on the summarization performance are investigated as well. In some cases, the soundness and readability of the generated summaries are unsatisfactory, because the summaries do not cover in an effective way all the semantically relevant data facets. A step beyond towards the generation of more accurate summaries has been made by semantics-based summarizers. Such approaches combine the use of general-purpose summarization strategies with ad-hoc linguistic analysis. The key idea is to also consider the semantics behind the document content to overcome the limitations of general-purpose strategies in differentiating between sentences based on their actual meaning and context. Most of the previously proposed approaches perform the semantics-based analysis as a preprocessing step that precedes the main summarization process. Therefore, the generated summaries could not entirely reflect the actual meaning and context of the key document sentences. In contrast, we aim at tightly integrating the ontology-based document analysis into the summarization process in order to take the semantic meaning of the document content into account during the sentence evaluation and selection processes. With this in mind, we propose a new multi-document summarizer, namely Yago-based Summarizer, that integrates an established ontology-based entity recognition and disambiguation step. Named Entity Recognition from Yago ontology is being used for the task of text summarization. The Named Entity Recognition (NER) task is concerned with marking occurrences of a specific object being mentioned. These mentions are then classified into a set of predefined categories. Standard categories include “person”, “location”, “geo-political organization”, “facility”, “organization”, and “time”. The use of NER in text summarization improved the summarization process by increasing the rank of informative sentences. To demonstrate the effectiveness of the proposed approach, we compared its performance on the DUC’04 benchmark document collections with that of a large number of state-of-the-art summarizers. Furthermore, we also performed a qualitative evaluation of the soundness and readability of the generated summaries and a comparison with the results that were produced by the most effective summarizers. A parallel effort has been devoted to integrating semantics-based models and the knowledge acquired from social networks into a document summarization model named as SociONewSum. The effort addresses the sentence-based generic multi-document summarization problem, which can be formulated as follows: given a collection of news articles ranging over the same topic, the goal is to extract a concise yet informative summary, which consists of most salient document sentences. An established ontological model has been used to improve summarization performance by integrating a textual entity recognition and disambiguation step. Furthermore, the analysis of the user-generated content coming from Twitter has been exploited to discover current social trends and improve the appealing of the generated summaries. An experimental evaluation of the SociONewSum performance was conducted on real English-written news article collections and Twitter posts. The achieved results demonstrate the effectiveness of the proposed summarizer, in terms of different ROUGE scores, compared to state-of-the-art open source summarizers as well as to a baseline version of the SociONewSum summarizer that does not perform any UGC analysis. Furthermore, the readability of the generated summaries has also been analyzed

    Multimedia Retrieval

    Get PDF

    Techniques for text classification: Literature review and current trends

    Get PDF
    Automated classification of text into predefined categories has always been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. This kind of web information, popularly known as the digital/electronic information is in the form of documents, conference material, publications, journals, editorials, web pages, e-mail etc. People largely access information from these online sources rather than being limited to archaic paper sources like books, magazines, newspapers etc. But the main problem is that this enormous information lacks organization which makes it difficult to manage. Text classification is recognized as one of the key techniques used for organizing such kind of digital data. In this paper we have studied the existing work in the area of text classification which will allow us to have a fair evaluation of the progress made in this field till date. We have investigated the papers to the best of our knowledge and have tried to summarize all existing information in a comprehensive and succinct manner. The studies have been summarized in a tabular form according to the publication year considering numerous key perspectives. The main emphasis is laid on various steps involved in text classification process viz. document representation methods, feature selection methods, data mining methods and the evaluation technique used by each study to carry out the results on a particular dataset

    Adaptive Visualization for Focused Personalized Information Retrieval

    Get PDF
    The new trend on the Web has totally changed todays information access environment. The traditional information overload problem has evolved into the qualitative level beyond the quantitative growth. The mode of producing and consuming information is changing and we need a new paradigm for accessing information.Personalized search is one of the most promising answers to this problem. However, it still follows the old interaction model and representation method of classic information retrieval approaches. This limitation can harm the potential of personalized search, with which users are intended to interact with the system, learn and investigate the problem, and collaborate with the system to reach the final goal.This dissertation proposes to incorporate interactive visualization into personalized search in order to overcome the limitation. By combining the personalized search and the interac- tive visualization, we expect our approach will be able to help users to better explore the information space and locate relevant information more efficiently.We extended a well-known visualization framework called VIBE (Visual Information Browsing Environment) and implemented Adaptive VIBE, so that it can fit into the per- sonalized searching environment. We tested the effectiveness of this adaptive visualization method and investigated its strengths and weaknesses by conducting a full-scale user study.We also tried to enrich the user models with named-entities considering the possibility that the traditional keyword-based user models could harm the effectiveness of the system in the context of interactive information retrieval.The results of the user study showed that the Adaptive VIBE could improve the precision of the personalized search system and could help the users to find out more diverse set of information. The named-entity based user model integrated into Adaptive VIBE showed improvements of precision of user annotations while maintaining the level of diverse discovery of information

    Exploiting Class Label Frequencies for Text Classification

    Get PDF
    Document classification is an example of Machine Learning (ML) in the form of Natural Language Processing (NLP). By classifying text, we are aiming to assign one or more classes or categories to a document, making it easier to manage and sort. In the vast majority of document classification techniques a document is represented as a bag of words consisting of all the individual terms making up the document together with the number of times each term appears in the document. The number of term occurrences is known as local term frequencies and it is very common to make use of the local term frequencies at the price of some added information in the classification model. In this work, we extend our previous work on medical article classification [1,2] by simplifying the weighting scheme in the ranking process using class label frequencies to device a simple weighting formula inspired from traditional information retrieval task. We also evaluate the proposed approach using more research experimental data.  The method we propose here, called CLF KNN first, it uses a lexical approach to identify frequency terms in the document texts and then, it uses this information coupled with class label information in corpus in a sophisticated way to devise a weighting ranking scheme in classification decision process. The evaluation experiments on two collections: The Ohsumed collection of medical documents and the 20 Newsgroup messages collection, show that the proposed method significantly outperforms traditional KNN classification
    • …
    corecore