39,938 research outputs found

    Exploratory Analysis of Highly Heterogeneous Document Collections

    Full text link
    We present an effective multifaceted system for exploratory analysis of highly heterogeneous document collections. Our system is based on intelligently tagging individual documents in a purely automated fashion and exploiting these tags in a powerful faceted browsing framework. Tagging strategies employed include both unsupervised and supervised approaches based on machine learning and natural language processing. As one of our key tagging strategies, we introduce the KERA algorithm (Keyword Extraction for Reports and Articles). KERA extracts topic-representative terms from individual documents in a purely unsupervised fashion and is revealed to be significantly more effective than state-of-the-art methods. Finally, we evaluate our system in its ability to help users locate documents pertaining to military critical technologies buried deep in a large heterogeneous sea of information.Comment: 9 pages; KDD 2013: 19th ACM SIGKDD Conference on Knowledge Discovery and Data Minin

    An iterative approach for lexicon characterization in juridical context

    Get PDF
    In the juridical context, knowledge management applications have a central role. In order to improve the effectiveness of document management procedures, techniques for automatic comprehension of textual content are required. In this work, a methodology for semi-automatic derivation of knowledge from document collections is proposed. In order to extract relevant information from document text, a process integrating both statistical and lexical approaches is applied. Moreover, we propose a system for the evaluation of the extracted peculiar lexicon quality. The system is used for the processing of heterogeneous documents corpus issued by Italy’s juridical domain

    Living Knowledge

    Get PDF
    Diversity, especially manifested in language and knowledge, is a function of local goals, needs, competences, beliefs, culture, opinions and personal experience. The Living Knowledge project considers diversity as an asset rather than a problem. With the project, foundational ideas emerged from the synergic contribution of different disciplines, methodologies (with which many partners were previously unfamiliar) and technologies flowed in concrete diversity-aware applications such as the Future Predictor and the Media Content Analyser providing users with better structured information while coping with Web scale complexities. The key notions of diversity, fact, opinion and bias have been defined in relation to three methodologies: Media Content Analysis (MCA) which operates from a social sciences perspective; Multimodal Genre Analysis (MGA) which operates from a semiotic perspective and Facet Analysis (FA) which operates from a knowledge representation and organization perspective. A conceptual architecture that pulls all of them together has become the core of the tools for automatic extraction and the way they interact. In particular, the conceptual architecture has been implemented with the Media Content Analyser application. The scientific and technological results obtained are described in the following

    TRECVID: benchmarking the effectiveness of information retrieval tasks on digital video

    Get PDF
    Many research groups worldwide are now investigating techniques which can support information retrieval on archives of digital video and as groups move on to implement these techniques they inevitably try to evaluate the performance of their techniques in practical situations. The difficulty with doing this is that there is no test collection or any environment in which the effectiveness of video IR or video IR sub-tasks, can be evaluated and compared. The annual series of TREC exercises has, for over a decade, been benchmarking the effectiveness of systems in carrying out various information retrieval tasks on text and audio and has contributed to a huge improvement in many of these. Two years ago, a track was introduced which covers shot boundary detection, feature extraction and searching through archives of digital video. In this paper we present a summary of the activities in the TREC Video track in 2002 where 17 teams from across the world took part
    corecore