5 research outputs found
Dagstuhl Annual Report January - December 2011
The International Conference and Research Center for Computer Science is a non-profit organization. Its objective is to promote world-class research in computer science and to host research seminars which enable new ideas to be showcased, problems to be discussed and the course to be set for future development in this field. The work being done to run this informatics center is documented in this report for the business year 2011
A Novel and Domain-Specific Document Clustering and Topic Aggregation Toolset for a News Organisation
Large collections of documents are becoming increasingly common in the news gathering industry. A review of the literature shows there is a growing interest in datadriven journalism and specifically that the journalism profession needs better tools to understand and develop actionable knowledge from large document sets. On a daily basis, journalists are tasked with searching a diverse range of document sets including news gathering services, emails, freedom of information requests, court records, government reports, press releases and many other types of generally unstructured documents. Document clustering techniques can help address problems of understanding the ever expanding quantities of documents available to journalists by finding patterns within documents. These patterns can be used to develop useful and actionable knowledge which can contribute to journalism. News articles in particular are fertile ground for document clustering principles. Term weighting schemes assign importance to terms within a document and are central to the study of document clustering methods. This study contributes a review of the dominant and most commonly used term frequency weighting functions put forward in research, establishes the merits and limitations of each approach, and proposes modifications to develop a news-centric document clustering and topic aggregation approach. Experimentation was conducted on a large unstructured collection of newspaper articles from the Irish Times to establish if the newly proposed news-centric term weighting and document similarity approach improves document clustering accuracy and topic aggregation capabilities for news articles when compared to the traditional term weighting approach. Whilst the experimentation shows that that the developed approach is promising when compared to the manual document clustering effort undertaken by the three journalist expert users, it also highlights the challenges of natural language processing and document clustering methods in general. The results may suggest that a blended approach of complimenting automated methods with human-level supervision and guidance may yield the best results
Challenges in Document Mining (Dagstuhl Seminar 11171)
This report documents the programme and outcomes of the Dagstuhl Seminar 11171
"Challenges in Document Mining". Our starting point was the observation
that document mining techniques are often applied in an isolated manner, with
the consequence that their potential is still to be fully realised. The goal
of the seminar was to analyze this untapped potential. To this end researchers
from the main areas of document mining were invited to present their views, to
synthesise an understanding of where and how the latest disciplinary
achievements can be combined, and to develop a more integrative view on the
state of the art and the prospects for future progress
Digital Object Identifier 10.4230/DagRep.1.4.65 Edited in cooperation with Melikka Khosh Niat 1 Executive Summary
This report documents the programme and outcomes of the Dagstuhl Seminar 11171 Challenges in Document Mining. Our starting point was the observation that document mining techniques are often applied in an isolated manner, with the consequence that their potential is still to be fully realised. The goal of the seminar was to analyze this untapped potential. To this end researchers from the main areas of document mining were invited to present their views, to synthesise an understanding of where and how the latest disciplinary achievements can be combined, and to develop a more integrative view on the state of the art and the prospects for future progress