PhD ThesisTraditional methods for storing and analysing data are proving inadequate for processing
\Big Data". This is due to its volume, and the rate at which it is being generated.
The limitations of current technologies are further exacerbated by the increased demand
for applications which allow users to access and interact with data as soon as
it is generated. Near real-time analysis such as this can be partially supported by
stream processing systems, however they currently lack the ability to store data for
e cient historic processing: many applications require a combination of near real-time
and historic data analysis. This thesis investigates this problem, and describes and
evaluates a novel approach for addressing it. Antares is a layered framework that has
been designed to exploit and extend the scalability of NoSQL databases to support low
latency querying and high throughput rates for both stream and historic data analysis
simultaneously.
Antares began as a company funded project, sponsored by Red Hat the motivation was
to identify a new technology which could provide scalable analysis of data, both stream
and historic. The motivation for this was to explore new methods for supporting scale
and e ciency, for example a layered approach. A layered approach would exploit the
scale of historic stores and the speed of in-memory processing. New technologies were
investigates to identify current mechanisms and suggest a means of improvement.
Antares supports a layered approach for analysis, the motivation for the platform was
to provide scalable, low latency querying of Twitter data for other researchers to help
automate analysis. Antares needed to provide temporal and spatial analysis of Twitter
data using the timestamp and geotag. The approach used Twitter as a use case and
derived requirements from social scientists for a broader research project called Tweet
My Street.
Many data streaming applications have a location-based aspect, using geospatial data
to enhance the functionality they provide. However geospatial data is inherently di -
cult to process at scale due to its multidimensional nature. To address these di culties,
- i -
this thesis proposes Antares as a new solution to providing scalable and e cient mechanisms
for querying geospatial data. The thesis describes the design of Antares and
evaluates its performance on a range of scenarios taken from a real social media analytics
application. The results show signi cant performance gains when compared to
existing approaches, for particular types of analysis.
The approach is evaluated by executing experiments across Antares and similar systems
to show the improved results. Antares demonstrates a layered approach can be
used to improve performance for inserts and searches as well as increasing the ingestion
rate of the system