4,161 research outputs found
Graph Summarization
The continuous and rapid growth of highly interconnected datasets, which are
both voluminous and complex, calls for the development of adequate processing
and analytical techniques. One method for condensing and simplifying such
datasets is graph summarization. It denotes a series of application-specific
algorithms designed to transform graphs into more compact representations while
preserving structural patterns, query answers, or specific property
distributions. As this problem is common to several areas studying graph
topologies, different approaches, such as clustering, compression, sampling, or
influence detection, have been proposed, primarily based on statistical and
optimization methods. The focus of our chapter is to pinpoint the main graph
summarization methods, but especially to focus on the most recent approaches
and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie
Data stream processing meets the Advanced Metering Infrastructure: possibilities, challenges and applications
Distribution of electricity is changing.Energy production is increasingly distributed, weather dependent and located in the distribution network, close to consumers.Energy consumption is increasing throughout society and the electrification of transportation is driving distribution networks closer to the limits.Operating the networks closer to their limits also increases the risk for faults.Continuous monitoring of the distribution network closest to the customers is needed in order to mitigate this risk.The Advanced Metering Infrastructure introduced smart meters throughout the distribution network.Data stream processing is a computing paradigm that offers low latency results from analysis on large volumes of the data.This thesis investigates the possibilities and challenges for continuous monitoring that are created when the Advanced Metering Infrastructure and data stream processing meet.The challenges that are addressed in the thesis are efficient processing of unordered (also called out-of-order) data and efficient usage of the computational resources present in the Advanced Metering Infrastructure.Contributions towards more efficient processing of out-of-order data are made with eChIDNA and TinTiN. Both are systems that utilize knowledge about smart meter data to directly produce results where possible and storing only data that is relevant for late data in order to produce updated results when such late data arrives. eChIDNA is integrated in the streaming query itself, while TinTiN is a streaming middleware that can be applied to streaming queries in order to make them resilient against out-of-order data.Eventual determinism is defined in order to formally investigate the deterministic properties of output produced by such systems.Contributions towards efficient usage of the computational resources of the Advanced Metering Infrastructure are made with the application LoCoVolt.LoCoVolt implements a monitoring algorithm that can run on equipment that is localized in the communication infrastructure of the Advanced Metering Infrastructure and can take advantage of the overlap between the communication and distribution networks.All contributions are evaluated on hardware that is available in current AMI systems, using large scale data obtained from a real production AMI
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Cognitive science and epistemic openness
Recent findings in cognitive science suggest that the epistemic subject is more complex and epistemically porous than is generally pictured. Human knowers are open to the world via multiple channels, each operating for particular purposes and according to its own logic. These findings need to be understood and addressed by the philosophical community. The current essay argues that one consequence of the new findings is to invalidate certain arguments for epistemic anti-realism
Accelerating Aggregation Queries on Unstructured Streams of Data
Analysts and scientists are interested in querying streams of video, audio,
and text to extract quantitative insights. For example, an urban planner may
wish to measure congestion by querying the live feed from a traffic camera.
Prior work has used deep neural networks (DNNs) to answer such queries in the
batch setting. However, much of this work is not suited for the streaming
setting because it requires access to the entire dataset before a query can be
submitted or is specific to video. Thus, to the best of our knowledge, no prior
work addresses the problem of efficiently answering queries over multiple
modalities of streams.
In this work we propose InQuest, a system for accelerating aggregation
queries on unstructured streams of data with statistical guarantees on query
accuracy. InQuest leverages inexpensive approximation models ("proxies") and
sampling techniques to limit the execution of an expensive high-precision model
(an "oracle") to a subset of the stream. It then uses the oracle predictions to
compute an approximate query answer in real-time. We theoretically analyzed
InQuest and show that the expected error of its query estimates converges on
stationary streams at a rate inversely proportional to the oracle budget. We
evaluated our algorithm on six real-world video and text datasets and show that
InQuest achieves the same root mean squared error (RMSE) as two streaming
baselines with up to 5.0x fewer oracle invocations. We further show that
InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle
invocations than a state-of-the-art batch setting algorithm.Comment: 14 pages, 11 figures, to be published in Proceedings of the VLDB
Endowment, Vol. 16, No. 1
- …