Search CORE

9 research outputs found

Efﬁcient dictionary compression for processing RDF big data using Google BigQuery

Author: Dawelbeit O.
McCrindle Rachel
Publication venue
Publication date: 16/02/2017
Field of study

The Resource Description Framework (RDF) data model, is used on the Web to express billions of structured statements in a wide range of topics, including government, publications, life sciences, etc. Consequently, processing and storing this data requires the provision of high speciﬁcation systems, both in terms of storage and computational capabilities. On the other hand, cloud-based big data services such as Google BigQuery can be used to store and query this data without any upfront investment. Google BigQuery pricing is based on the size of the data being stored or queried, but given that RDF statements contain long Uniform Resource Identiﬁers (URIs), the cost of query and storage of RDF big data can increase rapidly. In this paper we present and evaluate a novel and efﬁcient dictionary compression algorithm which is faster, generates small dictionaries that can ﬁt in memory and results in better compression rate when compared with other large scale RDF dictionary compression. Consequently, our algorithm also reduces the BigQuery storage and query cos

Central Archive at the University of Reading

Investigating elastic cloud based RDF processing

Author: Dawelbeit Omer
Publication venue
Publication date
Field of study

The Semantic Web was proposed as an extension of the traditional Web to give Web data context and meaning by using the Resource Description Framework (RDF) data model. The recent growth in the adoption of RDF in addition to the massive growth of RDF data, have led numerous efforts to focus on the challenges of processing this data. To this extent, many approaches have focused on vertical scalability by utilising powerful hardware, or horizontal scalability utilising always-on physical computer clusters or peer to peer networks. However, these approaches utilise fixed and high specification computer clusters that require considerable upfront and ongoing investments to deal with the data growth. In recent years cloud computing has seen wide adoption due to its unique elasticity and utility billing features. This thesis addresses some of the issues related to the processing of large RDF datasets by utilising cloud computing. Initially, the thesis reviews the background literature of related distributed RDF processing work and issues, in particular distributed rulebased reasoning and dictionary encoding, followed by a review of the cloud computing paradigm and related literature. Then, in order to fully utilise features that are specific to cloud computing such as elasticity, the thesis designs and fully implements a Cloud-based Task Execution framework (CloudEx), a generic framework for efficiently distributing and executing tasks on cloud environments. Subsequently, some of the large-scale RDF processing issues are addressed by using the CloudEx framework to develop algorithms for processing RDF using cloud computing. These algorithms perform efficient dictionary encoding and forward reasoning using cloud-based columnar databases. The algorithms are collectively implemented as an Elastic Cost Aware Reasoning Framework (ECARF), a cloud-based RDF triple store. This thesis presents original results and findings that advance the state of the art of performing distributed cloud-based RDF processing and forward reasoning

Central Archive at the University of Reading

An Empirical Evaluation of Columnar Storage Formats

Author: Hui Yulong
McKinney Wes
Pavlo Andrew
Shen Jiahong
Zeng Xinyu
Zhang Huanchen
Publication venue
Publication date: 11/04/2023
Field of study

Columnar storage is one of the core components of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed significantly. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. Our analysis identifies important considerations that may guide future formats to better fit modern technology trends

arXiv.org e-Print Archive

The Family of MapReduce and Large Scale Data Processing Systems

Author: Anna Liu
Ayman G. Fayoumi
King Abdulaziz
See Profile
Sherif Sakr
Sherif Sakr
South Wales
South Wales
Publication venue
Publication date: 12/02/2013
Field of study

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

arXiv.org e-Print Archive

CiteSeerX

Prototyping and Evaluation of Sensor Data Integration in Cloud Platforms

Author: Morlandstø Marcus Korsnes
Publication venue: The University of Bergen
Publication date: 08/08/2023
Field of study

The SFI Smart Ocean centre has initiated a long-running project which consists of developing a wireless and autonomous marine observation system for monitoring of underwater environments and structures. The increasing popularity of integrating the Internet of Things (IoT) with Cloud Computing has led to promising infrastructures that could realize Smart Ocean's goals. The project will utilize underwater wireless sensor networks (UWSNs) for collecting data in the marine environments and develop a cloud-based platform for retrieving, processing, and storing all the sensor data. Currently, the project is in its early stages and the collaborating partners are researching approaches and technologies that can potentially be utilized. This thesis contributes to the centre's ongoing research, focusing on the aspect of how sensor data can be integrated into three different cloud platforms: Microsoft Azure, Amazon Web Services, and the Google Cloud Platform. The goals were to develop prototypes that could successfully send data to the chosen cloud platforms and evaluate their applicability in context of the Smart Ocean project. In order to determine the most suitable option, each platform was evaluated based on set of defined criteria, focusing on their sensor data integration capabilities. The thesis has also investigated the cloud platforms' supported protocol bindings, as well as several candidate technologies for metadata standards and compared them in surveys. Our evaluation results shows that all three cloud platforms handle sensor data integration in very similar ways, offering a set of cloud services relevant for creating diverse IoT solutions. However, the Google Cloud Platform ranks at the bottom due to the lack of IoT focus on their platform, with less service options, features, and capabilities compared to the other two. Both Microsoft Azure and Amazon Web Services rank very close to each other, as they provide many of the same sensor data integration capabilities, making them the most applicable options.Masteroppgave i Programutvikling samarbeid med HVLPROG399MAMN-PRO

University of Bergen

Design of a reference architecture for an IoT sensor network

Author: Gómez Escobar Jairo Alejandro
Publication venue: International Center for Tropical Agriculture
Publication date: 15/04/2020
Field of study

CGSpace

Software Technologies:12th International Joint Conference, ICSOFT 2017, Madrid, Spain, July 24-26, 2017, Revised Selected Papers

Author: Cabello Enrique
Cardoso Jorge S.
Maciaszek Leszek A.
van Sinderen Marten J.
Publication venue: Springer
Publication date: 01/01/2018
Field of study

University of Twente Research Information

CERN Document Server

The Multimodal Tutor: Adaptive Feedback from Multimodal Experiences

Author: Di Mitri D.
Publication venue: Open Universiteit
Publication date: 04/09/2020
Field of study

This doctoral thesis describes the journey of ideation, prototyping and empirical testing of the Multimodal Tutor, a system designed for providing digital feedback that supports psychomotor skills acquisition using learning and multimodal data capturing. The feedback is given in real-time with machine-driven assessment of the learner's task execution. The predictions are tailored by supervised machine learning models trained with human annotated samples. The main contributions of this thesis are: a literature survey on multimodal data for learning, a conceptual model (the Multimodal Learning Analytics Model), a technological framework (the Multimodal Pipeline), a data annotation tool (the Visual Inspection Tool) and a case study in Cardiopulmonary Resuscitation training (CPR Tutor). The CPR Tutor generates real-time, adaptive feedback using kinematic and myographic data and neural networks

Open University of the Netherlands Research Portal