5 research outputs found

    Dagstuhl Annual Report January - December 2011

    Get PDF
    The International Conference and Research Center for Computer Science is a non-profit organization. Its objective is to promote world-class research in computer science and to host research seminars which enable new ideas to be showcased, problems to be discussed and the course to be set for future development in this field. The work being done to run this informatics center is documented in this report for the business year 2011

    A Novel and Domain-Specific Document Clustering and Topic Aggregation Toolset for a News Organisation

    Get PDF
    Large collections of documents are becoming increasingly common in the news gathering industry. A review of the literature shows there is a growing interest in datadriven journalism and specifically that the journalism profession needs better tools to understand and develop actionable knowledge from large document sets. On a daily basis, journalists are tasked with searching a diverse range of document sets including news gathering services, emails, freedom of information requests, court records, government reports, press releases and many other types of generally unstructured documents. Document clustering techniques can help address problems of understanding the ever expanding quantities of documents available to journalists by finding patterns within documents. These patterns can be used to develop useful and actionable knowledge which can contribute to journalism. News articles in particular are fertile ground for document clustering principles. Term weighting schemes assign importance to terms within a document and are central to the study of document clustering methods. This study contributes a review of the dominant and most commonly used term frequency weighting functions put forward in research, establishes the merits and limitations of each approach, and proposes modifications to develop a news-centric document clustering and topic aggregation approach. Experimentation was conducted on a large unstructured collection of newspaper articles from the Irish Times to establish if the newly proposed news-centric term weighting and document similarity approach improves document clustering accuracy and topic aggregation capabilities for news articles when compared to the traditional term weighting approach. Whilst the experimentation shows that that the developed approach is promising when compared to the manual document clustering effort undertaken by the three journalist expert users, it also highlights the challenges of natural language processing and document clustering methods in general. The results may suggest that a blended approach of complimenting automated methods with human-level supervision and guidance may yield the best results

    Challenges in Document Mining (Dagstuhl Seminar 11171)

    No full text
    This report documents the programme and outcomes of the Dagstuhl Seminar 11171 "Challenges in Document Mining". Our starting point was the observation that document mining techniques are often applied in an isolated manner, with the consequence that their potential is still to be fully realised. The goal of the seminar was to analyze this untapped potential. To this end researchers from the main areas of document mining were invited to present their views, to synthesise an understanding of where and how the latest disciplinary achievements can be combined, and to develop a more integrative view on the state of the art and the prospects for future progress

    Digital Object Identifier 10.4230/DagRep.1.4.65 Edited in cooperation with Melikka Khosh Niat 1 Executive Summary

    No full text
    This report documents the programme and outcomes of the Dagstuhl Seminar 11171 Challenges in Document Mining. Our starting point was the observation that document mining techniques are often applied in an isolated manner, with the consequence that their potential is still to be fully realised. The goal of the seminar was to analyze this untapped potential. To this end researchers from the main areas of document mining were invited to present their views, to synthesise an understanding of where and how the latest disciplinary achievements can be combined, and to develop a more integrative view on the state of the art and the prospects for future progress