2 research outputs found
On enhancing the robustness of timeline summarization test collections
Timeline generation systems are a class of algorithms that produce a sequence of time-ordered sentences or text snippets extracted in real-time from high-volume streams of digital documents (e.g. news articles), focusing on retaining relevant and informative content for a particular information need (e.g. topic or event). These systems have a range of uses, such as producing concise overviews of events for end-users (human or artificial agents). To advance the field of automatic timeline generation, robust and reproducible evaluation methodologies are needed. To this end, several evaluation metrics and labeling methodologies have recently been developed - focusing on information nugget or cluster-based ground truth representations, respectively. These methodologies rely on human assessors manually mapping timeline items (e.g. sentences) to an explicit representation of what information a ‘good’ summary should contain. However, while these evaluation methodologies produce reusable ground truth labels, prior works have reported cases where such evaluations fail to accurately estimate the performance of new timeline generation systems due to label incompleteness. In this paper, we first quantify the extent to which the timeline summarization test collections fail to generalize to new summarization systems, then we propose, evaluate and analyze new automatic solutions to this issue. In particular, using a depooling methodology over 19 systems and across three high-volume datasets, we quantify the degree of system ranking error caused by excluding those systems when labeling. We show that when considering lower-effectiveness systems, the test collections are robust (the likelihood of systems being miss-ranked is low). However, we show that the risk of systems being mis-ranked increases as the effectiveness of systems held-out from the pool increases. To reduce the risk of mis-ranking systems, we also propose a range of different automatic ground truth label expansion techniques. Our results show that the proposed expansion techniques can be effective at increasing the robustness of the TREC-TS test collections, as they are able to generate large numbers missing matches with high accuracy, markedly reducing the number of mis-rankings by up to 50%
Recommended from our members
Leveraging digital forensics and data exploration to understand the creative work of a filmmaker: a case study of Stephen Dwoskin’s digital archive
This paper aims to establish digital forensics and data exploration as a methodology for supporting archival practice and research into a filmmaker's creative processes. We approach this by exploring the digital legacy hard drives of the late artist Stephen Dwoskin (1939-2012), who is recognised as an influential filmmaker at the forefront of the shift from analogue to digital film production. The research findings of this case study show that digital forensics is effective in extracting a timeline of hard drive activities, data that can be explored to reveal clues about the artist’s personal/professional history, stages of creative processes, and technical environment. The paper further demonstrates how this is related to current thinking around user-centred archival workflow and understanding of creative processes. The broader impact of the work for advancing digital archiving and research into creative processes is highlighted, concluding with a discussion of how, going forward, the approach can be coupled with deeper content analysis to reveal what influences editing choices taking place over time