41,424 research outputs found
Efficient processing of similarity queries with applications
Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in big data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance.
The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL).
In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins).
In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system
Estimating Fire Weather Indices via Semantic Reasoning over Wireless Sensor Network Data Streams
Wildfires are frequent, devastating events in Australia that regularly cause
significant loss of life and widespread property damage. Fire weather indices
are a widely-adopted method for measuring fire danger and they play a
significant role in issuing bushfire warnings and in anticipating demand for
bushfire management resources. Existing systems that calculate fire weather
indices are limited due to low spatial and temporal resolution. Localized
wireless sensor networks, on the other hand, gather continuous sensor data
measuring variables such as air temperature, relative humidity, rainfall and
wind speed at high resolutions. However, using wireless sensor networks to
estimate fire weather indices is a challenge due to data quality issues, lack
of standard data formats and lack of agreement on thresholds and methods for
calculating fire weather indices. Within the scope of this paper, we propose a
standardized approach to calculating Fire Weather Indices (a.k.a. fire danger
ratings) and overcome a number of the challenges by applying Semantic Web
Technologies to the processing of data streams from a wireless sensor network
deployed in the Springbrook region of South East Queensland. This paper
describes the underlying ontologies, the semantic reasoning and the Semantic
Fire Weather Index (SFWI) system that we have developed to enable domain
experts to specify and adapt rules for calculating Fire Weather Indices. We
also describe the Web-based mapping interface that we have developed, that
enables users to improve their understanding of how fire weather indices vary
over time within a particular region.Finally, we discuss our evaluation results
that indicate that the proposed system outperforms state-of-the-art techniques
in terms of accuracy, precision and query performance.Comment: 20pages, 12 figure
- …