35 research outputs found
One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers
Sketches are probabilistic data structures that can provide approximate
results within mathematically proven error bounds while using orders of
magnitude less memory than traditional approaches. They are tailored for
streaming data analysis on architectures even with limited memory such as
single-board computers that are widely exploited for IoT and edge computing.
Since these devices offer multiple cores, with efficient parallel sketching
schemes, they are able to manage high volumes of data streams. However, since
their caches are relatively small, a careful parallelization is required. In
this work, we focus on the frequency estimation problem and evaluate the
performance of a high-end server, a 4-core Raspberry Pi and an 8-core Odroid.
As a sketch, we employed the widely used Count-Min Sketch. To hash the stream
in parallel and in a cache-friendly way, we applied a novel tabulation approach
and rearranged the auxiliary tables into a single one. To parallelize the
process with performance, we modified the workflow and applied a form of
buffering between hash computations and sketch updates. Today, many
single-board computers have heterogeneous processors in which slow and fast
cores are equipped together. To utilize all these cores to their full
potential, we proposed a dynamic load-balancing mechanism which significantly
increased the performance of frequency estimation.Comment: 12 pages, 4 figures, 3 algorithms, 1 table, submitted to EuroPar'1
One table to count them all: parallel frequency estimation on single-board computers
Sketches are probabilistic data structures that can provide approx- imate results within mathematically proven error bounds while using orders of magnitude less memory than traditional approaches. They are tailored for streaming data analysis on architectures even with limited memory such as single-board computers that are widely exploited for IoT and edge computing. Since these devices offer multiple cores, with efficient parallel sketching schemes, they are able to manage high volumes of data streams. However, since their caches are relatively small, a careful parallelization is required.
In this work, we focus on the frequency estimation problem and evaluate the performance of a high-end server, a 4-core Raspberry Pi and an 8-core Odroid. As a sketch, we employed the widely used Count-Min Sketch. To hash the stream in parallel and in a cache-friendly way, we applied a novel tabulation approach and rearranged the auxiliary tables into a single one. To parallelize the process with performance, we modified the workflow and applied a form of buffering between hash computations and sketch updates.
Today, many single-board computers have heterogeneous processors in which slow and fast cores are equipped together. To utilize all these cores to their full potential, we proposed a dynamic load-balancing mechanism which signif- icantly increased the performance of frequency estimation
CROSS-DISCIPLINARY COLLABORATIONS IN DATA QUALITY RESEARCH
Data Quality has been the target of research and development for over four decades, and due to its cross-disciplinary nature has been approached by business analysts, solution architects, database experts and statisticians to name a few. As data quality increases in importance and complexity, there is a need to motivate the exploitation of synergies across diverse research communities in order to form holistic solutions that span across its organizational, architectural and computational aspects. As a first step towards bridging gaps between the various research communities, we undertook a comprehensive literature study of data quality research published in the last two decades. In this study we considered a broad range of Information System (IS) and Computer Science (CS) publication outlets. The main aims of the study were to understand the current landscape of data quality research, create better awareness of (lack of) synergies between various research communities, and, subsequently, direct attention towards holistic solutions. In this paper, we present a summary of the findings from the study that outline the overlaps and distinctions between the two communities from various points of view, including publication outlets, topics and themes of research, highly cited or influential contributors and strength and nature of co-authorship networks
Adaptive Time Synchronization for Homogeneous WSNs
Wireless sensor networks (WSNs) are being
used for observing real‐world phenomenon. It is
important that sensor nodes (SNs) must be synchronized
to a common time in order to precisely map the data
collected by SNs. Clock synchronization is very
challenging in WSNs as the sensor networks are
resource constrained networks. It is essential that clock
synchronization protocols designed for WSNs must be
light weight i.e. SNs must be synchronized with fewer
synchronization message exchanges. In this paper, we
propose a clock synchronization protocol for WSNs
where first of all cluster heads (CHs) are synchronized
with the sink and then the cluster nodes (CNs) are
synchronized with their respective CHs. CNs are
synchronized with the help of time synchronization
node (TSN) chosen by the respective CHs. Simulation
results show that proposed protocol requires
considerably fewer synchronization messages as
compared with the reference broadcast synchronization
(RBS) protocol and minimum variance unbiased
estimation (MUVE) method. Clock skew correction
mechanism applied in proposed protocol guarantees
long term stability and hence decreases re‐
synchronization frequency thereby conserving more
energ
Enhanced Stream Processing in a DBMS Kernel
Continuous query processing has emerged as a promising query processing paradigm with numerous applications. A recent development is the need to handle both streaming queries and typical one-time queries in the same application. For example, data warehousing can greatly benefit from the integration of stream semantics, i.e., online analysis of incoming data and combination with existing data. This is especially useful to provide low latency in data-intensive analysis in big data warehouses that are augmented with new data on a daily basis.
However, state-of-the-art database technology cannot handle streams efficiently due to their "continuous" nature. At the same time, state-of-the-art stream technology is purely focused on stream applications. The research efforts are mostly geared towards the creation of specialized stream management systems built with a different philosophy than a DBMS. The drawback of this approach is the limited opportunities to exploit successful past data processing technology, e.g., query optimization techniques.
For this new problem we need to combine the best of both worlds. Here we take a completely different route by designing a stream engine on top of an existing relational database kernel. This includes reuse of both its storage/execution engine and its optimizer infrastructure. The major challenge then becomes the efficient support for specialized stream features. This paper focuses on incremental window-based processing, arguably the most crucial stream-specific requirement. In order to maintain and reuse the generic storage and execution model of the DBMS, we elevate the problem at the query plan level. Proper op