Search CORE

527 research outputs found

HadoopSec: Sensitivity-aware Secure Data Placement Strategy for Big Data/Hadoop Platform using Prescriptive Analytics

Author: Mukesh Rajeswari
Padmanaban Revathy
Publication venue: GSTF Journal on Computing (JoC)
Publication date: 27/01/2020
Field of study

Hadoop has become one of the key player in offeringdata analytics and data processing support for any organizationthat handles different shades of data management. Consideringthe current security offerings of Hadoop, companies areconcerned of building a single large cluster and onboardingmultiple projects on to the same common Hadoop cluster.Security vulnerability and privacy invasion due to maliciousattackers or inner users are the main argument points in anyHadoop implementation. In particular, various types of securityvulnerability occur due to the mode of data placement in HadoopCluster. When sensitive information is accessed by anunauthorized user or misused by an authorized person, they cancompromise privacy. In this paper, we intend to address theapproach of data placement across distributed DataNodes in asecure way by considering the sensitivity and security of theunderlying data. Our data placement strategy aims to adaptivelydistribute the data across the cluster using advanced machinelearning techniques to realize a more secured data/infrastructure.The data placement strategy discussed in this paper is highlyextensible and scalable to suit different sort of sensitivity/securityrequirements

GSTF Digital Library (GSTF-DL): Open Journal Systems (Global Science and Technology Forum)

HadoopSec: Sensitivity-aware Secure Data Placement Strategy for Big Data/Hadoop Platform using Prescriptive Analytics

Author: .Padmanaban Dr. Revathy
Publication venue: GSTF Journal on Computing (JoC)
Publication date: 26/01/2020
Field of study

GSTF Digital Library (GSTF-DL): Open Journal Systems (Global Science and Technology Forum)

Enabling Distributed Applications Optimization in Cloud Environment

Author: Liu Pinchao
Publication venue: FIU Digital Commons
Publication date: 25/03/2021
Field of study

The past few years have seen dramatic growth in the popularity of public clouds, such as Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Container-as-a-Service (CaaS). In both commercial and scientific fields, quick environment setup and application deployment become a mandatory requirement. As a result, more and more organizations choose cloud environments instead of setting up the environment by themselves from scratch. The cloud computing resources such as server engines, orchestration, and the underlying server resources are served to the users as a service from a cloud provider. Most of the applications that run in public clouds are the distributed applications, also called multi-tier applications, which require a set of servers, a service ensemble, that cooperate and communicate to jointly provide a certain service or accomplish a task. Moreover, a few research efforts are conducting in providing an overall solution for distributed applications optimization in the public cloud. In this dissertation, we present three systems that enable distributed applications optimization: (1) the first part introduces DocMan, a toolset for detecting containerized application’s dependencies in CaaS clouds, (2) the second part introduces a system to deal with hot/cold blocks in distributed applications, (3) the third part introduces a system named FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications

DigitalCommons@Florida International University

Nomadic fog storage

Author: Vales Ruben Oliveira
Publication venue
Publication date: 01/01/2017
Field of study

Mobile services incrementally demand for further processing and storage. However, mobile devices are known for their constrains in terms of processing, storage, and energy. Early proposals have addressed these aspects; by having mobile devices access remote clouds. But these proposals suffer from long latencies and backhaul bandwidth limitations in retrieving data. To mitigate these issues, edge clouds have been proposed. Using this paradigm, intermediate nodes are placed between the mobile devices and the remote cloud. These intermediate nodes should fulfill the end users’ resource requests, namely data and processing capability, and reduce the energy consumption on the mobile devices’ batteries. But then again, mobile traffic demand is increasing exponentially and there is a greater than ever evolution of mobile device’s available resources. This urges the use of mobile nodes’ extra capabilities for fulfilling the requisites imposed by new mobile applications. In this new scenario, the mobile devices should become both consumers and providers of the emerging services. The current work researches on this possibility by designing, implementing and testing a novel nomadic fog storage system that uses fog and mobile nodes to support the upcoming applications. In addition, a novel resource allocation algorithm has been developed that considers the available energy on mobile devices and the network topology. It also includes a replica management module based on data popularity. The comprehensive evaluation of the fog proposal has evidenced that it is responsive, offloads traffic from the backhaul links, and enables a fair energy depletion among mobiles nodes by storing content in neighbor nodes with higher battery autonomy.Os serviços móveis requerem cada vez mais poder de processamento e armazenamento. Contudo, os dispositivos móveis são conhecidos por serem limitados em termos de armazenamento, processamento e energia. Como solução, os dispositivos móveis começaram a aceder a estes recursos através de nuvens distantes. No entanto, estas sofrem de longas latências e limitações na largura de banda da rede, ao aceder aos recursos. Para resolver estas questões, foram propostas soluções de edge computing. Estas, colocam nós intermediários entre os dispositivos móveis e a nuvem remota, que são responsáveis por responder aos pedidos de recursos por parte dos utilizadores finais. Dados os avanços na tecnologia dos dispositivos móveis e o aumento da sua utilização, torna-se cada mais pertinente a utilização destes próprios dispositivos para fornecer os serviços da nuvem. Desta forma, o dispositivo móvel torna-se consumidor e fornecedor do serviço nuvem. O trabalho atual investiga esta vertente, implementado e testando um sistema que utiliza dispositivos móveis e nós no “fog”, para suportar os serviços móveis emergentes. Foi ainda implementado um algoritmo de alocação de recursos que considera os níveis de energia e a topologia da rede, bem como um módulo que gere a replicação de dados no sistema de acordo com a sua popularidade. Os resultados obtidos provam que o sistema é responsivo, alivia o tráfego nas ligações no core, e demonstra uma distribuição justa do consumo de energia no sistema através de uma disseminação eficaz de conteúdo nos nós da periferia da rede mais próximos dos nós consumidores

Repositório Institucional do ISCTE-IUL

RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design

Author: Asaad Sameh
Chen Deming
Huang Jian
Hwu Wen-mei
Li Daixuan
Reidys Benjamin
Sukhwani Bharat
Xue Yuqi
Publication venue
Publication date: 12/09/2023
Field of study

Software-defined networking (SDN) and software-defined flash (SDF) have been serving as the backbone of modern data centers. They are managed separately to handle I/O requests. At first glance, this is a reasonable design by following the rack-scale hierarchical design principles. However, it suffers from suboptimal end-to-end performance, due to the lack of coordination between SDN and SDF. In this paper, we co-design the SDN and SDF stack by redefining the functions of their control plane and data plane, and splitting up them within a new architecture named RackBlox. RackBlox decouples the storage management functions of flash-based solid-state drives (SSDs), and allow the SDN to track and manage the states of SSDs in a rack. Therefore, we can enable the state sharing between SDN and SDF, and facilitate global storage resource management. RackBlox has three major components: (1) coordinated I/O scheduling, in which it dynamically adjusts the I/O scheduling in the storage stack with the measured and predicted network latency, such that it can coordinate the effort of I/O scheduling across the network and storage stack for achieving predictable end-to-end performance; (2) coordinated garbage collection (GC), in which it will coordinate the GC activities across the SSDs in a rack to minimize their impact on incoming I/O requests; (3) rack-scale wear leveling, in which it enables global wear leveling among SSDs in a rack by periodically swapping data, for achieving improved device lifetime for the entire rack. We implement RackBlox using programmable SSDs and switch. Our experiments demonstrate that RackBlox can reduce the tail latency of I/O requests by up to 5.8x over state-of-the-art rack-scale storage systems.Comment: 14 pages. Published in published in ACM SIGOPS 29th Symposium on Operating Systems Principles (SOSP'23

arXiv.org e-Print Archive

GreenHDFS: data-centric and cyber-physical energy management system for big data clouds

Author: Kaushik Rini
Publication venue
Publication date: 01/12/2012
Field of study

Explosion in Big Data has led to a rapid increase in the popularity of Big Data analytics. With the increase in the sheer volume of data that needs to be stored and processed, storage and computing demands of the Big Data analytics workloads are growing exponentially, leading to a surge in extremely large-scale Big Data cloud platforms, and resulting in burgeoning energy costs and environmental impact. The sheer size of Big Data lends it significant data movement inertia and that coupled with the network bandwidth constraints inherent in the cloud's cost-efficient and scale-out economic paradigm, makes data-locality a necessity for high performance in the Big Data environments. Instead of sending data to the computations as has been the norm, computations are sent to the data to take advantage of the higher data-local performance. The state-of-the-art run-time energy management techniques are job-centric in nature and rely on thermal- and energy-aware job placement, job consolidation, or job migration to derive energy costs savings. Unfortunately, data-locality requirement of the compute model limits the applicability of the state-of-the-art run-time energy management techniques as these techniques are inherently data-placement-agnostic in nature, and provide energy savings at significant performance impact in the Big Data environment. Big Data analytics clusters have moved away from shared network attached storage (NAS) or storage area network (SAN) model to completely clustered, commodity storage model that allows direct access path between the storage servers and the clients in interest of high scalability and performance. The underlying storage system distributes file chunks and replicas across the servers for high performance, load-balancing, and resiliency. However, with files distributed across all servers, any server may be participating in the reading, writing, or computation of a file chunk at any time. Such a storage model complicates scale-down based power-management by making it hard to generate significant periods of idleness in the Big Data analytics clusters. GreenHDFS is based on the observation that data needs to be a first-class object in energy management in the Big Data environments to allow high data access performance. GreenHDFS takes a novel data-centric, cyber-physical approach to reduce compute (i.e., server) and cooling operating energy costs. On the physical-side, GreenHDFS is cognizant that all-servers-are-not-alike in the Big Data analytics cloud and is aware of the variations in the thermal-profiles of the servers. On the cyber-side, GreenHDFS is aware that all-data-is-not-alike and knows the differences in the data-semantics (i.e., computational jobs arrival rate, size, popularity, and evolution life spans) of the Big Data placed in the Big Data analytics cloud. Armed with this cyber-physical knowledge, and coupled with its insights, predictive data models, and run-time information GreenHDFS does proactive, cyber-physical, thermal- and energy-aware file placement, and data-classification-driven scale-down, which implicitly results in thermal- and energy-aware job placement in the Big Data analytics cloud compute model. GreenHDFS's data-centric energy- and thermal-management approach results in a reduction in energy costs without any associated performance impact, allows scale-down of a subset of servers in spite of the unique challenges posed by Big Data analytics cloud to scale-down, and ensures thermal-reliability of the servers in the cluster. GreenHDFS evaluation results with one-month long real-world traces from a production Big Data analytics cluster at Yahoo! show up to 59% reduction in the cooling energy costs while performing 9x better than the state-of-the-art data-agnostic cooling techniques, up to a 26% reduction in the server operating energy costs, and significant reduction in the total cost of ownership (TCO) of the Big Data analytics cluster. GreenHDFS provides a software-based mechanism to increase energy-proportionality even with non-energy-proportional server components. Free-cooling or air- and water-side economization (i.e., use outside air or natural water resources to cool the data center) is gaining popularity as it can result in significant cooling energy costs savings. There is also a drive towards increasing the cooling set point of the cooling systems to make them more efficient. If the ambient temperature of the outside air or the cooling set point temperature is high, the inlet temperatures of the servers get high which reduces their ability to dissipate computational heat, resulting in an increase in server temperatures. The servers are rated to operate safely only with a certain temperature range, beyond which the failure rates increase. GreenHDFS considers the differences in the thermal-reliability-driven load-tolerance upper-bound of the servers in its predictive thermal-aware file placement and places file chunks in a manner that ensures that temperatures of servers don't exceed temperature upper-bound. Thus, by ensuring thermal-reliability at all times and by lowering the overall temperature of the servers, GreenHDFS enables data centers to enjoy energy-saving economizer mode for longer periods of time and also enables an increase in the cooling set point. There are a substantial number of data centers that still rely fully on traditional air-conditioning. These data centers can not always be retrofitted with the economizer modes or hot- and cold-aisle air containment as incorporation of the economizer and air containment may require space for duct-work, and heat exchangers which may not be available in the data center. Existing data centers may also not be favorably located geographically; air-side economization is more viable in geographic locations where ambient air temperatures are low for most part of the year and humidity is in the tolerable range. GreenHDFS provides a software-based approach to enhance the cooling-efficiency of such traditional data centers as it lowers the overall temperature in the cluster, makes the thermal-profile much more uniform, and reduces hot air recirculation, resulting in lowered cooling energy costs

Illinois Digital Environment for Access to Learning and Scholarship Repository

Towards High-Performance Big Data Processing Systems

Author: Zhang Hong
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2018
Field of study

The amount of generated and stored data has been growing rapidly, It is estimated that 2.5 quintillion bytes of data are generated every day, and 90% of the data in the world today has been created in the last two years. How to solve these big data issues has become a hot topic in both industry and academia. Due to the complex of big data platform, we stratify it into four layers: storage layer, resource management layer, computing layer, and methodology layer. This dissertation proposes brand-new approaches to address the performance of big data platforms like Hadoop and Spark on these four layers. We first present an improved HDFS design called SMARTH, which optimizes the storage layer. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of \u27\u27high performance\u27\u27 datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2. Secondly, we propose an optimized Hadoop extension called MRapid, which significantly speeds up the execution of short jobs on the resource management layer. It is completely backward compatible to Hadoop, and imposes negligible overhead. Our experiments on Microsoft Azure public cloud show that MRapid can improve performance by up to 88% compared to the original Hadoop. Thirdly, we introduce an efficient 3-level sampling performance model, called Hedgehog, and focus on the relationship between resource and performance. This design is a brand new white-box model for Spark, which is more complex and challenging than Hadoop. In our tool, we employ a Java bytecode manipulation and analysis framework called ASM to reduce the profiling overhead dramatically. Fourthly, on the computing layer, we optimize the current implementation of SGD in Spark\u27s MLlib by reusing data partition for multiple times within a single iteration to find better candidate weights in a more efficient way. Whether using multiple local iterations within each partition is dynamically decided by the 68-95-99.7 rule. We also design a variant of momentum algorithm to optimize step size in every iteration. This method uses a new adaptive rule that decreases the step size whenever neighboring gradients show differing directions of significance. Experiments show that our adaptive algorithm is more efficient and can be 7 times faster compared to the original MLlib\u27s SGD. At last, on the application layer, we present a scalable and distributed geographic information system, called Dart, based on Hadoop and HBase. Dart provides a hybrid table schema to store spatial data in HBase so that the Reduce process can be omitted for operations like calculating the mean center and the median center. It employs reasonable pre-splitting and hash techniques to avoid data imbalance and hot region problems. It also supports massive spatial data analysis like K-Nearest Neighbors (KNN) and Geometric Median Distribution. In our experiments, we evaluate the performance of Dart by processing 160 GB Twitter data on an Amazon EC2 cluster. The experimental results show that Dart is very scalable and efficient

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Recommended from our members

High performance Monte Carlo computation for finance risk data analysis

Author: Zhao Yu
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2013
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Finance risk management has been playing an increasingly important role in the finance sector, to analyse finance data and to prevent any potential crisis. It has been widely recognised that Value at Risk (VaR) is an effective method for finance risk management and evaluation. This thesis conducts a comprehensive review on a number of VaR methods and discusses in depth their strengths and limitations. Among these VaR methods, Monte Carlo simulation and analysis has proven to be the most accurate VaR method in finance risk evaluation due to its strong modelling capabilities. However, one major challenge in Monte Carlo analysis is its high computing complexity of O(n²). To speed up the computation in Monte Carlo analysis, this thesis parallelises Monte Carlo using the MapReduce model, which has become a major software programming model in support of data intensive applications. MapReduce consists of two functions - Map and Reduce. The Map function segments a large data set into small data chunks and distribute these data chunks among a number of computers for processing in parallel with a Mapper processing a data chunk on a computing node. The Reduce function collects the results generated by these Map nodes (Mappers) and generates an output. The parallel Monte Carlo is evaluated initially in a small scale MapReduce experimental environment, and subsequently evaluated in a large scale simulation environment. Both experimental and simulation results show that the MapReduce based parallel Monte Carlo is greatly faster than the sequential Monte Carlo in computation, and the accuracy level is maintained as well. In data intensive applications, moving huge volumes of data among the computing nodes could incur high overhead in communication. To address this issue, this thesis further considers data locality in the MapReduce based parallel Monte Carlo, and evaluates the impacts of data locality on the performance in computation

Brunel University Research Archive

Improving capacity-performance tradeoffs in the storage tier

Author: Villasenor Eric P.
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2015
Field of study

Data-set sizes are growing. New techniques are emerging to organize and analyze these data-sets. There is a key access pattern emerging with these new techniques, large sequential file accesses. The trend toward bigger files exists to help amortize the cost of data accesses from the storage layer, as many workloads are recognized to be I/O bound. The storage layer is widely recognized as the slowest layer in the system. This work focuses on the tradeoff one can make with that storage capacity to improve system performance. ^ Capacity can be leveraged for improved availability or improved performance. This tradeoff is key in the storage layer, as this allows for data loss prevention and bandwidth aggregation. Typically these tradeoffs do not allow much choice with regard to capacity use. This work will leverage replication as the enabling mechanism to improve the capacity-performance tradeoff in the storage tier, while still providing for availability. ^ This capacity-performance tradeoff can be made at both the local and distributed file system level. I propose two techniques that allow for an improved tradeoff of capacity. The local file system can be employed on scale-out or scale-up infrastructures to improve performance. The distributed file system is targeted at distributed frameworks, such as MapReduce, to improve the cluster performance. The local file system design is MorphStore, and the distributed file system is BoostDFS. ^ MorphStore is a file system that significantly improves performance when accessing large files by using two innovations. MorphStore combines (a) load-adaptive I/O access scheduling to dynamically optimize throughput (aggregation), and (b) utility-xiii driven replication to best use capacity for performance. Additionally, adaptive-access scheduling can be utilized to optimize scheduling of requests (for throughput) on systems with a large number of storage devices. Replication is utilized to make available high utility files and then optimize throughput of these high utility files based on system load. ^ BoostDFS is a distributed file system that allows a better capacity-performance tradeoff via inter-node file replication. BoostDFS is built on the observation that distributed file systems currently inter-node replication for availability, but provide no mechanism to further improve performance. Replication for availability provides diminishing returns on performance, this is due to saturation of locality. BoostDFS exploits the common by improving I/O performance of these local tasks. This is done via intra-node replication by leveraging MorphStore as the local file system. This technique allows for capacity to be traded for availability as well as performance, with a small capacity overhead under constant availability. ^ Both MorphStore and BoostDFS utilize replication. Replication allows for both bandwidth aggregation and availability, This work primarily focuses on the performance utility of replication, but does not sacrifice availability in the process. These techniques provide an improved capacity-performance tradeoff while allowing the desired level of availability

Purdue E-Pubs