Large text streams are commonplace; news organisations are constantly producing stories
and people are constantly writing social media posts. These streams should be
analysed in real-time so useful information can be extracted and acted upon instantly.
When natural disasters occur people want to be informed, when companies announce
new products financial institutions want to know and when celebrities do things their
legions of fans want to feel involved. In all these examples people care about getting
information in real-time (low latency).
These streams are massively varied, people’s interests are typically classified by the
entities they are interested in. Organising a stream by the entity being referred to would
help people extract the information useful to them. This is a difficult task: fans of ‘Captain
America’ films will not want to be incorrectly told that ‘Chris Evans’ (the main
actor) was appointed to host ‘Top Gear’ when it was a different ‘Chris Evans’. People
who use local idiosyncrasies such as referring to their home county (‘Cornwall’) as
‘Kernow’ (the Cornish for ‘Cornwall’ that has entered the local lexicon) should not be
forced to change their language when finding out information about their home.
This thesis addresses a core problem for real-time entity-specific NLP: Streaming
cross document coreference resolution (CDC), how to automatically identify all the
entities mentioned in a stream in real-time.
This thesis address two significant problems for streaming CDC: There is no representative
dataset and existing systems consume more resources over time. A new
technique to create datasets is introduced and it was applied to social media (Twitter)
to create a large (6M mentions) and challenging new CDC dataset that contains a much
more variend range of entities than typical newswire streams. Existing systems are not
able to keep up with large data streams. This problem is addressed with a streaming
CDC system that stores a constant sized set of mentions. New techniques to maintain
the sample are introduced significantly out-performing existing ones maintaining 95%
of the performance of a non-streaming system while only using 20% of the memory