Scaling real-time event detection to massive streams

Abstract

In today’s world the internet and social media are omnipresent and information is accessible to everyone. This shifted the advantage from those who have access to information to those who do so first. Identifying new events as they emerge is of substantial value to financial institutions who consider realtime information in their decision making processes, as well as for journalists that report about breaking news and governmental agencies that collect information and respond to emergencies. First Story Detection is the task of identifying those documents in a stream of documents that talk about new events first. This seemingly simple task is non-trivial as the computational effort increases with every processed document. Standard approaches to solve First Story Detection determine a document’s novelty by comparing it to previously seen documents. This results in the highest reported accuracy but even the currently fastest system only scales to 10% of the Twitter stream. In this thesis, we propose a new algorithm family, called memory-based methods, able to scale to the full Twitter stream on a single core. Our memory-based method computes a document’s novelty up to two orders of magnitude faster than state-of-the-art systems without sacrificing accuracy. This thesis additional provides original work on the impact of processing unbounded data streams on detection accuracy. Our experiments reveal for the first time that the novelty scores of state-of-the-art comparison based and memory-based methods decay over time. We show how to counteract the discovered novelty decay and increase detection accuracy. Additionally, we show that memory-based methods are applicable beyond First Story Detection by building the first real time rumour detection system on social media streams

    Similar works