Search CORE

6 research outputs found

A systematic review of SQL-on-Hadoop by using compact data formats

Author: Plase Daiga
Publication venue
Publication date: 30/10/2016
Field of study

Article also submitted for publication in Baltic J. Modern Computing (BJMC) on October 5, 2016.There are huge volumes of raw data generated every day. The question is how to store these data in order to provide faster data access. The research direction in Big Data projects using Hadoop Technology, MapReduce kind of framework and compact data formats shows that two data formats (Avro and Parquet) support schema evolution and compression in order to utilize less storage space. In this paper, a systematic review of SQL-on-Hadoop by using Avro and Parquet has been performed over the past six years (2010–2015) using publications of conference proceedings and journals of IEEEXplore, ACM Digital Library, ScienceDirect. With the help of search strategy followed, 94 research papers have been identified out of which 17 have been analyzed as relevant papers. At the end, the conclusion has been made that direct comparison by compactness and fastness between Avro and Parquet do not exist in data science

E-resource repository of the University of Latvia

A comparison of HDFS compact data formats: Avro versus Parquet

Author: Niedrite Laila
Plase Daiga
Taranovs Romans
Publication venue: 'Vilnius Gediminas Technical University'
Publication date: 01/07/2017
Field of study

In this paper, file formats like Avro and Parquet are compared with text formats to evaluate the performance of the data queries. Different data query patterns have been evaluated. Cloudera’s open-source Apache Hadoop distribution CDH 5.4 has been chosen for the experiments presented in this article. The results show that compact data formats (Avro and Parquet) take up less storage space when compared with plain text data formats because of binary data format and compression advantage. Furthermore, data queries from the column based data format Parquet are faster when compared with text data formats and Avro. Article in English. HDFS glaustųjų duomenų formatų palyginimas: Avro prieš Parquet Santrauka Straipsnyje vertinamas duomenų užklausų našumas lyginant Avro ir Parguet failų formatus su teksto failų formatu. Tyrimuose taikytos įvairios duomenų užklausų formos, naudota Cloudera atvirojo kodo Apache Hadoop CDH 5.4 versijos programinė įranga. Tyrimo rezultatai patvirtina, kad glaustieji duomenų formatai (Avro ir Parguet) dėl galimybės įterpti dvejetainį kodą ir naudoti glaudą taupo atmintį. Parodoma, kad duomenų užklausos įvykdomos sparčiau naudojant Parquet nei Avro ar teksto failų formatus. Reikšminiai žodžiai: didieji duomenys, Hadoop, HDFS, Hive, Avro, Parquet

Directory of Open Access Journals

VGTU Journals (Vilnius Gediminas Technical University - Vilnius Tech)

Operation of Modular Smart Grid Applications Interacting through a Distributed Middleware

Author: Albin Frischenschlager
Konrad Diwold
Mario Faschang
Mark Stefan
Stephan Cejka
Publication venue: RonPub
Publication date: 01/01/2018
Field of study

IoT-functionality can broaden the scope of distribution system automation in terms of functionality and communication. However, it also poses risks regarding resource consumption and security. This article presents a field approved IoT-enabled smart grid middleware, which allows for flexible deployment and management of applications within smart grid operation. In the first part of the work, the resource consumption of the middleware is analyzed and current memory bottlenecks are identified. The bottlenecks can be resolved by introducing a new entity that allows to dynamically load multiple applications within one JVM. The performance was experimentally tested and the results suggest that its application can significantly reduce the applications' memory footprint on the physical device. The second part of the study identifies and discusses potential security threats, with a focus on attacks stemming from malicious software applications within the framework. In order to prevent such attacks a proxy based prevention mechanism is developed and demonstrated

RonPub -- Research Online Publishing

Time Series Management Systems:A Survey

Author: Jensen Søren Kejser
Pedersen Torben Bach
Thomsen Christian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/08/2017
Field of study

The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) have been developed to overcome the limitations of general purpose Database Management Systems (DBMSs) for times series management. In this paper, we present a thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications. Our classification is organized into categories based on the architectures observed during our analysis. In addition, we provide an overview of each system with a focus on the motivational use case that drove the development of the system, the functionality for storage and querying of time series a system implements, the components the system is composed of, and the capabilities of each system with regard to Stream Processing and Approximate Query Processing (AQP). Last, we provide a summary of research directions proposed by other researchers in the field and present our vision for a next generation TSMS.Comment: 20 Pages, 15 Figures, 2 Tables, Accepted for publication in IEEE TKD

arXiv.org e-Print Archive

Crossref

VBN

Model-Based Time Series Management at Scale

Author: Jensen Søren Kejser
Publication venue: Aalborg Universitetsforlag
Publication date: 01/01/2019
Field of study

VBN

On the Importance of Infrastructure-Awareness in Large-Scale Distributed Storage Systems

Author: Rizvi Syed Muhammad Sajjad
Publication venue: 'University of Waterloo'
Publication date: 07/01/2021
Field of study

Big data applications put significant latency and throughput demands on distributed storage systems. Meeting these demands requires storage systems to use a significant amount of infrastructure resources, such as network capacity and storage devices. Resource demands largely depend on the workloads and can vary significantly over time. Moreover, demand hotspots can move rapidly between different infrastructure locations. Existing storage systems are largely infrastructure-oblivious as they are designed to support a broad range of hardware and deployment scenarios. Most only use basic configuration information about the infrastructure to make important placement and routing decisions. In the case of cloud-based storage systems, cloud services have their own infrastructure-specific limitations, such as minimum request sizes and maximum number of concurrent requests. By ignoring infrastructure-specific details, these storage systems are unable to react to resource demand changes and may have additional inefficiencies from performing redundant network operations. As a result, provisioning enough resources for these systems to address all possible workloads and scenarios would be cost prohibitive. This thesis studies the performance problems in commonly used distributed storage systems and introduces novel infrastructure-aware design methods to improve their performance. First, it addresses the problem of slow reads due to network congestion that is induced by disjoint replica and path selection. Selecting a read replica separately from the network path can perform poorly if all paths to the pre-selected endpoints are congested. Second, this thesis looks at scalability limitations of consensus protocols that are commonly used in geo-distributed key value stores and distributed ledgers. Due to their network-oblivious designs, existing protocols redundantly communicate over highly oversubscribed WAN links, which poorly utilize network resources and limits consistent replication at large scale. Finally, this thesis addresses the need for a cloud-specific realtime storage system for capital market use cases. Public cloud infrastructures provide feature-rich and cost-effective storage services. However, existing realtime timeseries databases are not built to take advantage of cloud storage services. Therefore, they do not effectively utilize cloud services to provide high performance while minimizing deployment cost. This thesis presents three systems that address these problems by using infrastructure-aware design methods. Our performance evaluation of these systems shows that infrastructure-aware design is highly effective in improving the performance of large scale distributed storage systems

University of Waterloo's Institutional Repository