1,096 research outputs found
A parametric prototype for spatiotemporal databases
The main goal of this project is to design and implement the parametric database (ParaDB). Conceptually, ParaDB consists of the parametric data model (ParaDM) and the parametric structured query language (ParaSQL). Parametric data model is a data model for multi-dimensional databases such as temporal, spatial, spatiotemporal, or multi-level secure databases. Main difference compared to the classical relational data model is that ParaDM models an object as a single tuple, and an attribute is defined as a function from parametric elements. The set of parametric elements is closed under union, intersection, and complementation. These operations are counterparts of or, and, and not in a natural language like English. Therefore, the closure properties provide very flexible ways to query on objects without introducing additional self-join operations which are frequently required in other multi-dimensional database models
âEnhanced Encryption and Fine-Grained Authorization for Database Systems
The aim of this research is to enhance fine-grained authorization and encryption
so that database systems are equipped with the controls necessary to help
enterprises adhere to zero-trust security more effectively. For fine-grained
authorization, this thesis has extended database systems with three new
concepts: Row permissions, column masks and trusted contexts. Row
permissions and column masks provide data-centric security so the security
policy cannot be bypassed as with database views, for example. They also
coexist in harmony with the rest of the database core tenets so that enterprises
are not forced to compromise neither security nor database functionality. Trusted
contexts provide applications in multitiered environments with a secure and
controlled manner to propagate user identities to the database and therefore
enable such applications to delegate the security policy to the database system
where it is enforced more effectively. Trusted contexts also protect against
application bypass so the application credentials cannot be abused to make
database changes outside the scope of the applicationâs business logic. For
encryption, this thesis has introduced a holistic database encryption solution to
address the limitations of traditional database encryption methods. It too coexists
in harmony with the rest of the database core tenets so that enterprises are not
forced to choose between security and performance as with column encryption,
for example. Lastly, row permissions, column masks, trusted contexts and holistic
database encryption have all been implemented IBM DB2, where they are relied
upon by thousands of organizations from around the world to protect critical data
and adhere to zero-trust security more effectively
On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark
Querying very large RDF data sets in an efficient manner requires a
sophisticated distribution strategy. Several innovative solutions have recently
been proposed for optimizing data distribution with predefined query workloads.
This paper presents an in-depth analysis and experimental comparison of five
representative and complementary distribution approaches. For achieving fair
experimental results, we are using Apache Spark as a common parallel computing
framework by rewriting the concerned algorithms using the Spark API. Spark
provides guarantees in terms of fault tolerance, high availability and
scalability which are essential in such systems. Our different implementations
aim to highlight the fundamental implementation-independent characteristics of
each approach in terms of data preparation, load balancing, data replication
and to some extent to query answering cost and performance. The presented
measures are obtained by testing each system on one synthetic and one
real-world data set over query workloads with differing characteristics and
different partitioning constraints.Comment: 16 pages, 3 figure
Analytical Queries: A Comprehensive Survey
Modern hardware heterogeneity brings efficiency and performance opportunities
for analytical query processing. In the presence of continuous data volume and
complexity growth, bridging the gap between recent hardware advancements and
the data processing tools ecosystem is paramount for improving the speed of ETL
and model development. In this paper, we present a comprehensive overview of
existing analytical query processing approaches as well as the use and design
of systems that use heterogeneous hardware for the task. We then analyze
state-of-the-art solutions and identify missing pieces. The last two chapters
discuss the identified problems and present our view on how the ecosystem
should evolve
A Survey of Distributed Data Stream Processing Frameworks
Big data processing systems are evolving to be more stream oriented where each data record is processed as it arrives by distributed and low-latency computational frameworks on a continuous basis. As the stream processing technology matures and more organizations invest in digital transformations, new applications of stream analytics will be identified and implemented across a wide spectrum of industries. One of the challenges in developing a streaming analytics infrastructure is the difficulty in selecting the right stream processing framework for the different use cases. With a view to addressing this issue, in this paper we present a taxonomy, a comparative study of distributed data stream processing and analytics frameworks, and a critical review of representative open source (Storm, Spark Streaming, Flink, Kafka Streams) and commercial (IBM Streams) distributed data stream processing frameworks. The study also reports our ongoing study on a multilevel streaming analytics architecture that can serve as a guide for organizations and individuals planning to implement a real-time data stream processing and analytics framework
An Educatorâs Perspective of the Tidyverse
Computing makes up a large and growing component of data science and statistics courses. Many of those courses, especially when taught by faculty who are statisticians by training, teach R as the programming language. A number of instructors have opted to build much of their teaching around use of the tidyverse. The tidyverse, in the words of its developers, âis a collection of R packages that share a high-level design philosophy and low-level grammar and data structures, so that learning one package makes it easier to learn the nextâ (Wickham et al. 2019). These shared principles have led to the widespread adoption of the tidyverse ecosystem. A large part of this usage is because the tidyverse tools have been intentionally designed to ease the learning process and make it easier for users to learn new functions as they engage with additional pieces of the larger ecosystem. Moreover, the functionality offered by the packages within the tidyverse spans the entire data science cycle, which includes data import, visualisation, wrangling, modeling, and communication. We believe the tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle. In this paper, we introduce the tidyverse from an educatorâs perspective. We provide a brief introduction to the tidyverse, demonstrate how foundational statistics and data science tasks are accomplished with the tidyverse, and discuss the strengths of the tidyverse, particularly in the context of teaching and learning
- âŠ