2 research outputs found

    Network Challenges of Novel Sources of Big Data

    Get PDF
    Networks and networking technologies are the key components of Big Data systems. Modern and future wireless sensor networks (WSN) act as one of the major sources of data for Big Data systems. Wireless networking technologies allow to offload the traffic generated by WSNs to the Internet access points for further delivery to the cloud storage systems. In this thesis we concentrate on the detailed analysis of the following two networking aspects of future Big Data systems: (i) efficient data collection algorithms in WSNs and (ii) wireless data delivery to the Internet access points.The performance evaluation and optimization models developed in the thesis are based on the application of probability theory, theory of stochastic processes, Markov chain theory, stochastic and integral geometries and the queuing theory.The introductory part discusses major components of Big Data systems, identify networking aspects as the subject of interest and formulates the tasks for the thesis. Further, different challenges of Big Data systems are presented in detail with several competitive architectures highlighted. After that, we proceed investigating data collection approaches in modern and future WSNs. We back up the possibility of using the proposed techniques by providing the associated performance evaluation results. We also pay attention to the process of collected data delivery to the Internet backbone access point, and demonstrate that the capacity of conventional cellular systems may not be sufficient for a set of WSN applications including both video monitoring at macro-scale and sensor data delivery from the nano/micro scales. Seeking for a wireless technology for data offloading from WSNs, we study millimeter and terahertz bands. We show there that the interference structure and signal propagation are fundamentally different due to the required use of highly directional antennas, human blocking and molecular absorption. Finally, to characterize the process of collected data transmission from a number of WSNs over the millimeter wave or terahertz backhauls we formulate and solve a queuing system with multiple auto correlated inputs and the service distribution corresponding to the transmission time over a wireless channel with hybrid automatic repeat request mechanism taken into account

    Understanding Spark System Performance for Image Processing in a Heterogeneous Commodity Cluster

    Get PDF
    In recent years, Apache Spark has seen a widespread adoption in industries and institutions due to its cache mechanism for faster Big Data analytics. However, the speed advantage Spark provides, especially in a heterogeneous cluster environment, is not obtainable out-of-the-box; it requires the right combination of configuration parameters from the myriads of parameters provided by Spark developers. Recognizing this challenge, this thesis undertakes a study to provide insight on Spark performance particularly, regarding the impact of choice parameter settings. These are parameters that are critical to fast job completion and effective utilization of resources. To this end, the study focuses on two specific example applications namely, flowerCounter and imageClustering, for processing still image datasets of Canola plants collected during the Summer of 2016 from selected plot fields using timelapse cameras in a heterogeneous Spark-clustered environments. These applications were of initial interest to the Plant Phenotyping and Imaging Research Centre (P2IRC) at the University of Saskatchewan. The P2IRC is responsible for developing systems that will aid fast analysis of large-scale seed breeding to ensure global food security. The flowerCounter application estimates the count of flowers from the images while the imageClustering application clusters images based on physical plant attributes. Two clusters are used for the experiments: a 12-node and 3-node cluster (including a master node), with Hadoop Distributed File System (HDFS) as the storage medium for the image datasets. Experiments with the two case study applications demonstrate that increasing the number of tasks does not always speed-up job processing due to increased communication overheads. Findings from other experiments show that numerous tasks with one core per executor and small allocated memory limits parallelism within an executor and result in inefficient use of cluster resources. Executors with large CPU and memory, on the other hand, do not speed-up analytics due to processing delays and threads concurrency. Further experimental results indicate that application processing time depends on input data storage in conjunction with locality levels and executor run time is largely dominated by the disk I/O time especially, the read time cost. With respect to horizontal node scaling, Spark scales with increasing homogeneous computing nodes but the speed-up degrades with heterogeneous nodes. Finally, this study shows that the effectiveness of speculative tasks execution in mitigating the impact of slow nodes varies for the applications
    corecore