237,720 research outputs found
Thinking spatial
The systems community in both academia and industry has tremendous success in building widely used general purpose systems for various types of data and applications. Examples include database systems, big data systems, data streaming systems, and machine learning systems. The vast majority of these systems are ill equipped in terms of supporting spatial data. The main reason is that system builders mostly think of spatial data as just one more type of data. Any spatial support can be considered as an afterthought problem that can be supported via on-top functions or spatial cartridges that can be added to the already built systems. This article advocates that spatial data and applications need to be natively supported in special purpose systems, where spatial data is considered as a first class citizen, while spatial operations are built inside the engine rather than on-top of it. System builders should consider spatial data while building their systems. The article gives examples of five categories of systems, namely, database systems, big data systems, machine learning systems, recommender systems, and social network systems, that would benefit tremendously, in terms of both accuracy and performance, when considering spatial data as an integral part of the system engine
A Prospective Analysis of Security Vulnerabilities within Link Traversal-Based Query Processing (Extended Version)
The societal and economical consequences surrounding Big Data-driven
platforms have increased the call for decentralized solutions. However,
retrieving and querying data in more decentralized environments requires
fundamentally different approaches, whose properties are not yet well
understood. Link Traversal-based Query Processing (LTQP) is a technique for
querying over decentralized data networks, in which a client-side query engine
discovers data by traversing links between documents. Since decentralized
environments are potentially unsafe due to their non-centrally controlled
nature, there is a need for client-side LTQP query engines to be resistant
against security threats aimed at the query engine's host machine or the query
initiator's personal data. As such, we have performed an analysis of potential
security vulnerabilities of LTQP. This article provides an overview of security
threats in related domains, which are used as inspiration for the
identification of 10 LTQP security threats. Each threat is explained, together
with an example, and one or more avenues for mitigations are proposed. We
conclude with several concrete recommendations for LTQP query engine developers
and data publishers as a first step to mitigate some of these issues. With this
work, we start filling the unknowns for enabling querying over decentralized
environments. Aside from future work on security, wider research is needed to
uncover missing building blocks for enabling true decentralization.Comment: This is an extended version of an article with the same title
published in the proceedings of the QuWeDa workshop at ISWC 2022. Next to
more details in the related work and conclusions sections, this extension
introduces concrete mitigations of each vulnerabilit
Ontology-Based Data Access to Big Data
Recent approaches to ontology-based data access (OBDA) have extended the focus from relational database systems to other types of backends such as cluster frameworks in order to cope with the four Vs associated with big data: volume, veracity, variety and velocity (stream processing). The abstraction that an ontology provides is a benefit from the enduser point of view, but it represents a challenge for developers because high-level queries must be transformed into queries executable on the backend level. In this paper, we discuss and evaluate an OBDA system that uses STARQL (Streaming and Temporal ontology Access with a Reasoning-based Query Language), as a high-level query language to access data stored in a SPARK cluster framework. The development of the STARQL-SPARK engine show that there is a need to provide a homogeneous interface to access both static and temporal as well as streaming data because cluster frameworks usually lack such an interface. The experimental evaluation shows that building a scalable OBDA system that runs with SPARK is more than plug-and-play as one needs to know quite well the data formats and the data organisation in the cluster framework
Internal combustion engine sensor network analysis using graph modeling
In recent years there has been a rapid development in technologies for smart monitoring applied to many different areas (e.g. building automation, photovoltaic systems, etc.). An intelligent monitoring system employs multiple sensors distributed within a network to extract useful information for decision-making. The management and the analysis of the raw data derived from the sensor network includes a number of specific challenges still unresolved, related to the different communication standards, the heterogeneous structure and the huge volume of data.
In this paper we propose to apply a method based on complex network theory, to evaluate the performance of an Internal Combustion Engine. Data are gathered from the OBD sensor subset and from the emission analyzer. The method provides for the graph modeling of the sensor network, where the nodes are represented by the sensors and the edge are evaluated with non-linear statistical correlation functions applied to the time series pairs.
The resulting functional graph is then analyzed with the topological metrics of the network, to define characteristic proprieties representing useful indicator for the maintenance and diagnosis
Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture
We present the architecture behind Twitter's real-time related query
suggestion and spelling correction service. Although these tasks have received
much attention in the web search literature, the Twitter context introduces a
real-time "twist": after significant breaking news events, we aim to provide
relevant results within minutes. This paper provides a case study illustrating
the challenges of real-time data processing in the era of "big data". We tell
the story of how our system was built twice: our first implementation was built
on a typical Hadoop-based analytics stack, but was later replaced because it
did not meet the latency requirements necessary to generate meaningful
real-time results. The second implementation, which is the system deployed in
production, is a custom in-memory processing engine specifically designed for
the task. This experience taught us that the current typical usage of Hadoop as
a "big data" platform, while great for experimentation, is not well suited to
low-latency processing, and points the way to future work on data analytics
platforms that can handle "big" as well as "fast" data
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
Apache Calcite is a foundational software framework that provides query
processing, optimization, and query language support to many popular
open-source data processing systems such as Apache Hive, Apache Storm, Apache
Flink, Druid, and MapD. Calcite's architecture consists of a modular and
extensible query optimizer with hundreds of built-in optimization rules, a
query processor capable of processing a variety of query languages, an adapter
architecture designed for extensibility, and support for heterogeneous data
models and stores (relational, semi-structured, streaming, and geospatial).
This flexible, embeddable, and extensible architecture is what makes Calcite an
attractive choice for adoption in big-data frameworks. It is an active project
that continues to introduce support for the new types of data sources, query
languages, and approaches to query processing and optimization.Comment: SIGMOD'1
Ringo: Interactive Graph Analytics on Big-Memory Machines
We present Ringo, a system for analysis of large graphs. Graphs provide a way
to represent and analyze systems of interacting objects (people, proteins,
webpages) with edges between the objects denoting interactions (friendships,
physical interactions, links). Mining graphs provides valuable insights about
individual objects as well as the relationships among them.
In building Ringo, we take advantage of the fact that machines with large
memory and many cores are widely available and also relatively affordable. This
allows us to build an easy-to-use interactive high-performance graph analytics
system. Graphs also need to be built from input data, which often resides in
the form of relational tables. Thus, Ringo provides rich functionality for
manipulating raw input data tables into various kinds of graphs. Furthermore,
Ringo also provides over 200 graph analytics functions that can then be applied
to constructed graphs.
We show that a single big-memory machine provides a very attractive platform
for performing analytics on all but the largest graphs as it offers excellent
performance and ease of use as compared to alternative approaches. With Ringo,
we also demonstrate how to integrate graph analytics with an iterative process
of trial-and-error data exploration and rapid experimentation, common in data
mining workloads.Comment: 6 pages, 2 figure
- …