9 research outputs found
Efficient dictionary compression for processing RDF big data using Google BigQuery
The Resource Description Framework (RDF) data model, is used on the Web to express billions of structured statements in a wide range of topics, including government, publications, life sciences, etc. Consequently, processing and storing this data requires the provision of high specification systems, both in terms of storage and computational capabilities. On the other hand, cloud-based big data services such as Google BigQuery can be used to store and query this data without any upfront investment. Google BigQuery pricing is based on the size of the data being stored or queried, but given that RDF statements contain long Uniform Resource Identifiers (URIs), the cost of query and storage of RDF big data can increase rapidly. In this paper we present and evaluate a novel and efficient dictionary compression algorithm which is faster, generates small dictionaries that can fit in memory and results in better compression rate when compared with other large scale RDF dictionary compression. Consequently, our algorithm also reduces the BigQuery storage and query cos
Investigating elastic cloud based RDF processing
The Semantic Web was proposed as an extension of the traditional Web to give Web data context and meaning by using the Resource Description Framework (RDF) data model. The recent growth in the adoption of RDF in addition to the massive growth of RDF data, have led numerous efforts to focus on the challenges of processing this data. To this extent, many approaches have focused on vertical scalability by utilising powerful hardware, or horizontal scalability utilising always-on physical computer clusters or peer to peer networks. However, these approaches utilise fixed and high specification computer clusters that require considerable upfront and ongoing investments to deal with the data growth. In recent years cloud computing has seen wide adoption due to its unique elasticity and utility billing features.
This thesis addresses some of the issues related to the processing of large RDF datasets by utilising cloud computing. Initially, the thesis reviews the background literature of related distributed RDF processing work and issues, in particular distributed rulebased reasoning and dictionary encoding, followed by a review of the cloud computing paradigm and related literature. Then, in order to fully utilise features that are specific to cloud computing such as elasticity, the thesis designs and fully implements a Cloud-based Task Execution framework (CloudEx), a generic framework for efficiently
distributing and executing tasks on cloud environments. Subsequently, some of the large-scale RDF processing issues are addressed by using the CloudEx framework to develop algorithms for processing RDF using cloud computing. These algorithms perform efficient dictionary encoding and forward reasoning using cloud-based
columnar databases. The algorithms are collectively implemented as an Elastic Cost Aware Reasoning Framework (ECARF), a cloud-based RDF triple store. This thesis
presents original results and findings that advance the state of the art of performing distributed cloud-based RDF processing and forward reasoning
An Empirical Evaluation of Columnar Storage Formats
Columnar storage is one of the core components of a modern data analytics
system. Although many database management systems (DBMSs) have proprietary
storage formats, most provide extensive support to open-source storage formats
such as Parquet and ORC to facilitate cross-platform data sharing. But these
formats were developed over a decade ago, in the early 2010s, for the Hadoop
ecosystem. Since then, both the hardware and workload landscapes have changed
significantly.
In this paper, we revisit the most widely adopted open-source columnar
storage formats (Parquet and ORC) with a deep dive into their internals. We
designed a benchmark to stress-test the formats' performance and space
efficiency under different workload configurations. From our comprehensive
evaluation of Parquet and ORC, we identify design decisions advantageous with
modern hardware and real-world data distributions. These include using
dictionary encoding by default, favoring decoding speed over compression ratio
for integer encoding algorithms, making block compression optional, and
embedding finer-grained auxiliary data structures. Our analysis identifies
important considerations that may guide future formats to better fit modern
technology trends
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Prototyping and Evaluation of Sensor Data Integration in Cloud Platforms
The SFI Smart Ocean centre has initiated a long-running project which consists of developing a wireless and autonomous marine observation system for monitoring of underwater environments and structures. The increasing popularity of integrating the Internet of Things (IoT) with Cloud Computing has led to promising infrastructures that could realize Smart Ocean's goals. The project will utilize underwater wireless sensor networks (UWSNs) for collecting data in the marine environments and develop a cloud-based platform for retrieving, processing, and storing all the sensor data. Currently, the project is in its early stages and the collaborating partners are researching approaches and technologies that can potentially be utilized. This thesis contributes to the centre's ongoing research, focusing on the aspect of how sensor data can be integrated into three different cloud platforms: Microsoft Azure, Amazon Web Services, and the Google Cloud Platform. The goals were to develop prototypes that could successfully send data to the chosen cloud platforms and evaluate their applicability in context of the Smart Ocean project. In order to determine the most suitable option, each platform was evaluated based on set of defined criteria, focusing on their sensor data integration capabilities. The thesis has also investigated the cloud platforms' supported protocol bindings, as well as several candidate technologies for metadata standards and compared them in surveys. Our evaluation results shows that all three cloud platforms handle sensor data integration in very similar ways, offering a set of cloud services relevant for creating diverse IoT solutions. However, the Google Cloud Platform ranks at the bottom due to the lack of IoT focus on their platform, with less service options, features, and capabilities compared to the other two. Both Microsoft Azure and Amazon Web Services rank very close to each other, as they provide many of the same sensor data integration capabilities, making them the most applicable options.Masteroppgave i Programutvikling samarbeid med HVLPROG399MAMN-PRO
The Multimodal Tutor: Adaptive Feedback from Multimodal Experiences
This doctoral thesis describes the journey of ideation, prototyping and empirical testing of the Multimodal Tutor, a system designed for providing digital feedback that supports psychomotor skills acquisition using learning and multimodal data capturing. The feedback is given in real-time with machine-driven assessment of the learner's task execution. The predictions are tailored by supervised machine learning models trained with human annotated samples. The main contributions of this thesis are: a literature survey on multimodal data for learning, a conceptual model (the Multimodal Learning Analytics Model), a technological framework (the Multimodal Pipeline), a data annotation tool (the Visual Inspection Tool) and a case study in Cardiopulmonary Resuscitation training (CPR Tutor). The CPR Tutor generates real-time, adaptive feedback using kinematic and myographic data and neural networks