284 research outputs found

    On the Usability of Shortest Remaining Time First Policy in Shared Hadoop Clusters

    Get PDF
    International audienceHadoop has been recently used to process a diverse variety of applications, sharing the same execution infrastructure. A practical problem facing the Hadoop community is how to reduce job makespans by reducing job waiting times and ex- ecution times. Previous Hadoop schedulers have focused on improving job execution times, by improving data locality but not considering job waiting times. Even worse, enforcing data locality according to the job input sizes can be ineffi- cient: it can lead to long waiting times for small yet short jobs when sharing the cluster with jobs with smaller input sizes but higher execution complexity. This paper presents hSRTF, an adaption of the well-known Shortest Remaining Time First scheduler (i.e., SRTF) in shared Hadoop clus- ters. hSRTF embraces a simple model to estimate the re- maining time of a job and a preemption primitive (i.e., kill) to free the resources when needed. We have implemented hSRTF and performed extensive evaluations with Hadoop on the Grid’5000 testbed. The results show that hSRTF can significantly reduce the waiting times of small jobs and therefore improves their makespans, but at the cost of a rel- atively small increase in the makespans of large jobs. For instance, a time-based proportional share mode of hSRTF (i.e., hSRTF-Pr) speeds up small jobs by (on average) 45% and 26% while introducing a performance degradation for large jobs by (on average) 10% and 0.2% compared to Fifo and Fair schedulers, respectively

    Secure entity authentication

    Get PDF
    According to Wikipedia, authentication is the act of confirming the truth of an attribute of a single piece of a datum claimed true by an entity. Specifically, entity authentication is the process by which an agent in a distributed system gains confidence in the identity of a communicating partner (Bellare et al.). Legacy password authentication is still the most popular one, however, it suffers from many limitations, such as hacking through social engineering techniques, dictionary attack or database leak. To address the security concerns in legacy password-based authentication, many new authentication factors are introduced, such as PINs (Personal Identification Numbers) delivered through out-of-band channels, human biometrics and hardware tokens. However, each of these authentication factors has its own inherent weaknesses and security limitations. For example, phishing is still effective even when using out-of-band-channels to deliver PINs (Personal Identification Numbers). In this dissertation, three types of secure entity authentication schemes are developed to alleviate the weaknesses and limitations of existing authentication mechanisms: (1) End user authentication scheme based on Network Round-Trip Time (NRTT) to complement location based authentication mechanisms; (2) Apache Hadoop authentication mechanism based on Trusted Platform Module (TPM) technology; and (3) Web server authentication mechanism for phishing detection with a new detection factor NRTT. In the first work, a new authentication factor based on NRTT is presented. Two research challenges (i.e., the secure measurement of NRTT and the network instabilities) are addressed to show that NRTT can be used to uniquely and securely identify login locations and hence can support location-based web authentication mechanisms. The experiments and analysis show that NRTT has superior usability, deploy-ability, security, and performance properties compared to the state-of-the-art web authentication factors. In the second work, departing from the Kerb eros-centric approach, an authentication framework for Hadoop that utilizes Trusted Platform Module (TPM) technology is proposed. It is proven that pushing the security down to the hardware level in conjunction with software techniques provides better protection over software only solutions. The proposed approach provides significant security guarantees against insider threats, which manipulate the execution environment without the consent of legitimate clients. Extensive experiments are conducted to validate the performance and the security properties of the proposed approach. Moreover, the correctness and the security guarantees are formally proved via Burrows-Abadi-Needham (BAN) logic. In the third work, together with a phishing victim identification algorithm, NRTT is used as a new phishing detection feature to improve the detection accuracy of existing phishing detection approaches. The state-of-art phishing detection methods fall into two categories: heuristics and blacklist. The experiments show that the combination of NRTT with existing heuristics can improve the overall detection accuracy while maintaining a low false positive rate. In the future, to develop a more robust and efficient phishing detection scheme, it is paramount for phishing detection approaches to carefully select the features that strike the right balance between detection accuracy and robustness in the face of potential manipulations. In addition, leveraging Deep Learning (DL) algorithms to improve the performance of phishing detection schemes could be a viable alternative to traditional machine learning algorithms (e.g., SVM, LR), especially when handling complex and large scale datasets

    Cloud Computing cost and energy optimization through Federated Cloud SoS

    Get PDF
    2017 Fall.Includes bibliographical references.The two most significant differentiators amongst contemporary Cloud Computing service providers have increased green energy use and datacenter resource utilization. This work addresses these two issues from a system's architectural optimization viewpoint. The proposed approach herein, allows multiple cloud providers to utilize their individual computing resources in three ways by: (1) cutting the number of datacenters needed, (2) scheduling available datacenter grid energy via aggregators to reduce costs and power outages, and lastly by (3) utilizing, where appropriate, more renewable and carbon-free energy sources. Altogether our proposed approach creates an alternative paradigm for a Federated Cloud SoS approach. The proposed paradigm employs a novel control methodology that is tuned to obtain both financial and environmental advantages. It also supports dynamic expansion and contraction of computing capabilities for handling sudden variations in service demand as well as for maximizing usage of time varying green energy supplies. Herein we analyze the core SoS requirements, concept synthesis, and functional architecture with an eye on avoiding inadvertent cascading conditions. We suggest a physical architecture that diminishes unwanted outcomes while encouraging desirable results. Finally, in our approach, the constituent cloud services retain their independent ownership, objectives, funding, and sustainability means. This work analyzes the core SoS requirements, concept synthesis, and functional architecture. It suggests a physical structure that simulates the primary SoS emergent behavior to diminish unwanted outcomes while encouraging desirable results. The report will analyze optimal computing generation methods, optimal energy utilization for computing generation as well as a procedure for building optimal datacenters using a unique hardware computing system design based on the openCompute community as an illustrative collaboration platform. Finally, the research concludes with security features cloud federation requires to support to protect its constituents, its constituents tenants and itself from security risks

    Designing, Building, and Modeling Maneuverable Applications within Shared Computing Resources

    Get PDF
    Extending the military principle of maneuver into war-fighting domain of cyberspace, academic and military researchers have produced many theoretical and strategic works, though few have focused on researching actual applications and systems that apply this principle. We present our research in designing, building and modeling maneuverable applications in order to gain the system advantages of resource provisioning, application optimization, and cybersecurity improvement. We have coined the phrase “Maneuverable Applications” to be defined as distributed and parallel application that take advantage of the modification, relocation, addition or removal of computing resources, giving the perception of movement. Our work with maneuverable applications has been within shared computing resources, such as the Clemson University Palmetto cluster, where multiple users share access and time to a collection of inter-networked computers and servers. In this dissertation, we describe our implementation and analytic modeling of environments and systems to maneuver computational nodes, network capabilities, and security enhancements for overcoming challenges to a cyberspace platform. Specifically we describe our work to create a system to provision a big data computational resource within academic environments. We also present a computing testbed built to allow researchers to study network optimizations of data centers. We discuss our Petri Net model of an adaptable system, which increases its cybersecurity posture in the face of varying levels of threat from malicious actors. Lastly, we present work and investigation into integrating these technologies into a prototype resource manager for maneuverable applications and validating our model using this implementation

    Recurring Query Processing on Big Data

    Get PDF
    The advances in hardware, software, and networks have enabled applications from business enterprises, scientific and engineering disciplines, to social networks, to generate data at unprecedented volume, variety, velocity, and varsity not possible before. Innovation in these domains is thus now hindered by their ability to analyze and discover knowledge from the collected data in a timely and scalable fashion. To facilitate such large-scale big data analytics, the MapReduce computing paradigm and its open-source implementation Hadoop is one of the most popular and widely used technologies. Hadoop\u27s success as a competitor to traditional parallel database systems lies in its simplicity, ease-of-use, flexibility, automatic fault tolerance, superior scalability, and cost effectiveness due to its use of inexpensive commodity hardware that can scale petabytes of data over thousands of machines. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of these analytic applications. Efficient execution and optimization techniques must be designed to assure the responsiveness and scalability of these recurring queries. In this dissertation, we thoroughly investigate topics in the area of recurring query processing on big data. In this dissertation, we first propose a novel scalable infrastructure called Redoop that treats recurring query over big evolving data as first class citizens during query processing. This is in contrast to state-of-the-art MapReduce/Hadoop system experiencing significant challenges when faced with recurring queries including redundant computations, significant latencies, and huge application development efforts. Redoop offers innovative window-aware optimization techniques for recurring query execution including adaptive window-aware data partitioning, window-aware task scheduling, and inter-window caching mechanisms. Redoop retains the fault-tolerance of MapReduce via automatic cache recovery and task re-execution support as well. Second, we address the crucial need to accommodate hundreds or even thousands of recurring analytics queries that periodically execute over frequently updated data sets, e.g., latest stock transactions, new log files, or recent news feeds. For many applications, such recurring queries come with user-specified service-level agreements (SLAs), commonly expressed as the maximum allowed latency for producing results before their merits decay. On top of Redoop, we built a scalable multi-query sharing engine tailored for recurring workloads in the MapReduce infrastructure, called Helix. Helix deploys new sliced window-alignment techniques to create sharing opportunities among recurring queries without introducing additional I/O overheads or unnecessary data scans. Furthermore, Helix introduces a cost/benefit model for creating a sharing plan among the recurring queries, and a scheduling strategy for executing them to maximize the SLA satisfaction. Third, recurring analytics queries tend to be expensive, especially when query processing consumes data sets in the hundreds of terabytes or more. Time sensitive recurring queries, such as fraud detection, often come with tight response time constraints as query deadlines. Data sampling is a popular technique for computing approximate results with an acceptable error bound while reducing high-demand resource consumption and thus improving query turnaround times. In this dissertation, we propose the first fast approximate query engine for recurring workloads in the MapReduce infrastructure, called Faro. Faro introduces two key innovations: (1) a deadline-aware sampling strategy that builds samples from the original data with reduced sample sizes compared to uniform sampling, and (2) adaptive resource allocation strategies that maximally improve the approximate results while assuring to still meet the response time requirements specified in recurring queries. In our comprehensive experimental study of each part of this dissertation, we demonstrate the superiority of the proposed strategies over state-of-the-art techniques in scalability, effectiveness, as well as robustness

    Optimizing the performance of optimization in the cloud environment–An intelligent auto-scaling approach

    Get PDF
    The cloud computing paradigm has gained wide acceptance in the scientific community, taking a significant share from fields previously reserved exclusively for High Performance Computing (HPC). On-demand access to a large amount of computing resources provided by Cloud makes it ideal for executing large-scale optimizations using evolutionary algorithms without the need for owning any computing infrastructure. In this regard, we extended WoBinGO, an existing parallel software framework for genetic algorithm based optimization, to be used in Cloud. With these extensions, the framework is capable of elastically and frugally utilizing the underlying cloud computing infrastructure for performing computationally expensive fitness evaluations. We studied two issues that are pertinent when dealing with large-scale optimization in the elastic cloud environment: the computing instance launching overhead and the price of engaging Cloud for solving optimization problems, in terms of the instances’ cumulative uptime. To explain the usability limits of WoBinGO framework running in the IaaS environment, a comprehensive analysis of the framework’s performance was given. Optimization of both total optimization time and total cumulative uptime, leads to minimizing the cost of cloud resources utilization. In this way, we are proposing an intelligent decision support engine based on artificial neural networks and metaheuristics to provide the user with an assessment of the framework’s behavior on the underlying infrastructure in terms of optimization duration and the cost of resource consumption. According to a given assessment, the user can decide upon faster delivery of results or lower infrastructure costs. The proposed software framework has been used to solve a complex real-world optimization problem of a subsurface rock mass model calibration. The results obtained from the private OpenStack deployment show that by using the proposed decision support engine, significant savings can be achieved in both optimization time and optimization cost

    HPC techniques for large scale data analysis

    Get PDF
    In the present work we apply High-Performance Computing techniques to two Big Data problems. The frst one deals with the analysis of large graphs by using a parallel distributed architecture, whereas the second one consists in the design and implementation of a scalable solution for fast indexing and searching of large datasets of heterogeneous documents
    corecore