3,695 research outputs found

    Min-energy scheduling for aligned jobs in accelerate model

    Get PDF
    AbstractA dynamic voltage scaling technique provides the capability for processors to adjust the speed and control the energy consumption. We study the pessimistic accelerate model where the acceleration rate of the processor speed is at most K and jobs cannot be executed during the speed transition period. The objective is to find a min-energy (optimal) schedule that finishes every job within its deadline. The job set we study in this paper is aligned jobs where earlier released jobs have earlier deadlines. We start by investigating a special case where all jobs have a common arrival time and design an O(n2) algorithm to compute the optimal schedule based on some nice properties of the optimal schedule. Then, we study the general aligned jobs and obtain an O(n2) algorithm to compute the optimal schedule by using the algorithm for the common arrival time case as a building block. Because our algorithm relies on the computation of the optimal schedule in the ideal model (K=∞), in order to achieve O(n2) complexity, we improve the complexity of computing the optimal schedule in the ideal model for aligned jobs from the currently best known O(n2logn) to O(n2)

    Strengthening Construction Management in the Rural Rehab Line of Business

    Get PDF
    The Five Key ObservationsObservation#1: Rural rehab success emanated from positive thinking and persistent implementationObservation #2: Almost every RHRO would benefit from a substantial increase in the per unit funding available, especially in light of the forthcoming HUD HOME requirement to establish written rehab standards in ten subcategories.Observation #3: A smartphone and tablet with 20 to 40 apps is the rehab specialist's Swiss Army knife. They are our, GPS, calculator, spec writer, office lifeline in case of danger, camera, clock, cost estimator calendar and a hundred other single-purpose but very important uses.Observation #4: NeighborWorks¼ Rural Initiative could provide a clearinghouse for success techniques targeted to rural rehab. Each month it might focus on a specific aspect of rehab management; inspection checklists in January, green specs in February, feasibility checklist in March, contractor qualification questionnaires in April and so on.Observation #5: Even with most components of in-house contractor success formula in place, per the Statistic Research Institute 53% of construction firms go out of business with in the first 4 years. It remains a very risky model that requires significant; funding, staff experience, administrative support and risk tolerance.Three Rehab Production Models And Their AlternativesThis middle section restates the introduction and methodology and offers a detailed review of the Traditional Rehab Specialist, Construction Management Of Subcontractor and the In-House General Contractor production models .for each model the article provides: definition and staffing pattern, design roles and tasks for each major player, benefits and challenges, alternative models and finally recommendations for successful implementationFocus TopicsDuring our interview process, three ideas surfaced that were best served with a mini discussion of the topic rather than being embedded in the already large middle section.The three topics are; software and technology, management of community relations – marketing and quality control, and budget solution

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges

    Full text link
    High Performance Computing (HPC) clouds are becoming an alternative to on-premise clusters for executing scientific applications and business analytics services. Most research efforts in HPC cloud aim to understand the cost-benefit of moving resource-intensive applications from on-premise environments to public cloud platforms. Industry trends show hybrid environments are the natural path to get the best of the on-premise and cloud resources---steady (and sensitive) workloads can run on on-premise resources and peak demand can leverage remote resources in a pay-as-you-go manner. Nevertheless, there are plenty of questions to be answered in HPC cloud, which range from how to extract the best performance of an unknown underlying platform to what services are essential to make its usage easier. Moreover, the discussion on the right pricing and contractual models to fit small and large users is relevant for the sustainability of HPC clouds. This paper brings a survey and taxonomy of efforts in HPC cloud and a vision on what we believe is ahead of us, including a set of research challenges that, once tackled, can help advance businesses and scientific discoveries. This becomes particularly relevant due to the fast increasing wave of new HPC applications coming from big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR

    An energy-aware scheduling approach for resource-intensive jobs using smart mobile devices as resource providers

    Get PDF
    The ever-growing adoption of smart mobile devices is a worldwide phenomenon that positions smart-phones and tablets as primary devices for communication and Internet access. In addition to this, the computing capabilities of such devices, often underutilized by their owners, are in continuous improvement. Today, smart mobile devices have multi-core CPUs, several gigabytes of RAM, and ability to communicate through several wireless networking technologies. These facts caught the attention of researchers who have proposed to leverage smart mobile devices aggregated computing capabilities for running resource intensive software. However, such idea is conditioned by key features, named singularities in the context of this thesis, that characterize resource provision with smart mobile devices.These are the ability of devices to change location (user mobility), the shared or non-dedicated nature of resources provided (lack of ownership) and the limited operation time given by the finite energy source (exhaustible resources).Existing proposals materializing this idea differ in the singularities combinations they target and the way they address each singularity, which make them suitable for distinct goals and resource exploitation opportunities. The latter are represented by real life situations where resources provided by groups of smart mobile devices can be exploited, which in turn are characterized by a social context and a networking support used to link and coordinate devices. The behavior of people in a given social context configure a special availability level of resources, while the underlying networking support imposes restrictionson how information flows, computational tasks are distributed and results are collected. The latter constitutes one fundamental difference of proposals mainly because each networking support ?i.e., ad-hoc and infrastructure based? has its own application scenarios. Aside from the singularities addressed and the networking support utilized, the weakest point of most of the proposals is their practical applicability. The performance achieved heavily relies on the accuracy with which task information, including execution time and/or energy required for execution, is provided to feed the resource allocator.The expanded usage of wireless communication infrastructure in public and private buildings, e.g., shoppings, work offices, university campuses and so on, constitutes a networking support that can be naturally re-utilized for leveraging smart mobile devices computational capabilities. In this context, this thesisproposal aims to contribute with an easy-to-implement  scheduling approach for running CPU-bound applications on a cluster of smart mobile devices. The approach is aware of the finite nature of smart mobile devices energy, and it does not depend on tasks information to operate. By contrast, it allocatescomputational resources to incoming tasks using a node ranking-based strategy. The ranking weights nodes combining static and dynamic parameters, including benchmark results, battery level, number of queued tasks, among others. This node ranking-based task assignment, or first allocation phase, is complemented with a re-balancing phase using job stealing techniques. The second allocation phase is an aid to the unbalanced load provoked as consequence of the non-dedicated nature of smart mobile devices CPU usage, i.e., the effect of the owner interaction, tasks heterogeneity, and lack of up-to-dateand accurate information of remaining energy estimations. The evaluation of the scheduling approach is through an in-vitro simulation. A novel simulator which exploits energy consumption profiles of real smart mobile devices, as well as, fluctuating CPU usage built upon empirical models, derived from real users interaction data, is another major contribution. Tests that validate the simulation tool are provided and the approach is evaluated in scenarios varying the composition of nodes, tasks and nodes characteristics including different tasks arrival rates, tasks requirements and different levels of nodes resource utilization.Fil: Hirsch Jofré, Matías Eberardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Tandil. Instituto Superior de Ingeniería del Software. Universidad Nacional del Centro de la Provincia de Buenos Aires. Instituto Superior de Ingeniería del Software; Argentin

    Proceedings of the Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015) Krakow, Poland

    Get PDF
    Proceedings of: Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015). Krakow (Poland), September 10-11, 2015

    Tiling Optimization For Nested Loops On Gpus

    Get PDF
    Optimizing nested loops has been considered as an important topic and widely studied in parallel programming. With the development of GPU architectures, the performance of these computations can be significantly boosted with the massively parallel hardware. General matrix-matrix multiplication is a typical example where executing such an algorithm on GPUs outperforms the performance obtained on other multicore CPUs. However, achieving ideal performance on GPUs usually requires a lot of human effort to manage the massively parallel computation resources. Therefore, the efficient implementation of optimizing nested loops on GPUs became a popular topic in recent years. We present our work based on the tiling strategy in this dissertation to address three kinds of popular problems. Different kinds of computations bring in different latency issues where dependencies in the computation may result in insufficient parallelism and the performance of computations without dependencies may be degraded due to intensive memory accesses. In this thesis, we tackle the challenges for each kind of problem and believe that other computations performed in nested loops can also benefit from the presented techniques. We improve a parallel approximation algorithm for the problem of scheduling jobs on parallel identical machines to minimize makespan with a high-dimensional tiling method. The algorithm is designed and optimized for solving this kind of problem efficiently on GPUs. Because the algorithm is based on a higher-dimensional dynamic programming approach, where dimensionality refers to the number of variables in the dynamic programming equation characterizing the problem, the existing implementation suffers from the pain of dimensionality and cannot fully utilize GPU resources. We design a novel data-partitioning technique to accelerate the higher-dimensional dynamic programming component of the algorithm. Both the load imbalance and exceeding memory capacity issues are addressed in our GPU solution. We present performance results to demonstrate how our proposed design improves the GPU utilization and makes it possible to solve large higher-dimensional dynamic programming problems within the limited GPU memory. Experimental results show that the GPU implementation achieves up to 25X speedup compared to the best existing OpenMP implementation. In addition, we focus on optimizing wavefront parallelism on GPUs. Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research on such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. Wavefront parallelism faces the load imbalance issue because the parallelism is passing along the diagonal. The tiling method has been introduced as a popular solution to address this issue. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. We evaluate the performance of our proposed technique for four different applications: Sequence Alignment, Edit Distance, Summed-Area Table, and 2DSOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation. Finally, we extend the hyperplane tiling to high order 2D stencil computations. Unlike wavefront parallelism that has dependence in the spatial dimension, dependence remains only across two adjacent time steps along the temporal dimension in stencil computations. Even if the no-dependence property significantly increases the parallelism obtained in the spatial dimensions, full parallelism may not be efficient on GPUs. Due to the limited cache capacity owned by each streaming multiprocessor, full parallelism can be obtained on global memory only, which has high latency to access. Therefore, the tiling technique can be applied to improve the memory efficiency by caching the small tiled blocks. Because the widely studied tiling methods, like overlapped tiling and split tiling, have considerable computation overhead caused by load imbalance or extra operations, we propose a time skewed tiling method, which is designed upon the GPU architecture. We work around the serialized computation issue and coordinate the intra-tile parallelism and inter-tile parallelism to minimize the load imbalance caused by pipelined processing. Moreover, we address the high-order stencil computations in our development, which has not been comprehensively studied. The proposed method achieves up to 3.5X performance improvement when the stencil computation is performed on a Moore neighborhood pattern
    • 

    corecore