10 research outputs found

    A performance comparison of the contiguous allocation strategies in 3D mesh connected multicomputers

    Get PDF
    The performance of contiguous allocation strategies can be significantly affected by the distribution of job execution times. In this paper, the performance of the existing contiguous allocation strategies for 3D mesh multicomputers is re-visited in the context of heavy-tailed distributions (e.g., a Bounded Pareto distribution). The strategies are evaluated and compared using simulation experiments for both First-Come-First-Served (FCFS) and Shortest-Service-Demand (SSD) scheduling strategies under a variety of system loads and system sizes. The results show that the performance of the allocation strategies degrades considerably when job execution times follow a heavy-tailed distribution. Moreover, SSD copes much better than FCFS scheduling strategy in the presence of heavy-tailed job execution times. The results also show that the strategies that depend on a list of allocated sub-meshes for both allocation and deallocation have lower allocation overhead and deliver good system performance in terms of average turnaround time and mean system utilization

    Autotuning and Self-Adaptability in Concurrency Libraries

    Get PDF
    Autotuning is an established technique for optimizing the performance of parallel applications. However, programmers must prepare applications for autotuning, which is tedious and error prone coding work. We demonstrate how applications become ready for autotuning with few or no modifications by extending Threading Building Blocks (TBB), a library for parallel programming, with autotuning. The extended TBB library optimizes all application-independent tuning parameters fully automatically. We compare manual effort, autotuning overhead and performance gains on 17 examples. While some examples benefit only slightly, others speed up by 28% over standard TBB.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014) (arXiv:1405.2281

    Online Adaptive Code Generation and Tuning

    Full text link

    Efficient processor allocation strategies for mesh-connected multicomputers

    Get PDF
    Abstract Efficient processor allocation and job scheduling algorithms are critical if the full computational power of large-scale multicomputers is to be harnessed effectively. Processor allocation is responsible for selecting the set of processors on which parallel jobs are executed, whereas job scheduling is responsible for determining the order in which the jobs are executed. Many processor allocation strategies have been devised for mesh-connected multicomputers and these can be divided into two main categories: contiguous and non-contiguous. In contiguous allocation, jobs are allocated distinct contiguous processor sub-meshes for the duration of their execution. Such a strategy could lead to high processor fragmentation which degrades system performance in terms of, for example, the turnaround time and system utilisation. In non-contiguous allocation, a job can execute on multiple disjoint smaller sub-meshes rather than waiting until a single sub-mesh of the requested size and shape is available. Although non-contiguous allocation increases message contention inside the network, lifting the contiguity condition can reduce processor fragmentation and increase system utilisation. Processor fragmentation can be of two types: internal and external. The former occurs when more processors are allocated to a job than it requires while the latter occurs when there are free processors enough in number to satisfy another job request, but they are not allocated to it because they are not contiguous. A lot of efforts have been devoted to reducing fragmentation, and a number of contiguous allocation strategies have been devised to recognize complete sub-meshes during allocation. Most of these strategies have been suggested for 2D mesh-connected multicomputers. However, although the 3D mesh has been the underlying network topology for a number of important multicomputers, there has been relatively little activity with regard to designing similar strategies for such a network. The very few contiguous allocation strategies suggested for the 3D mesh achieve complete sub-mesh recognition ability only at the expense of a high allocation overhead (i.e., allocation and de-allocation time). Furthermore, the allocation overhead in the existing contiguous strategies often grows with system size. The main challenge is therefore to devise an efficient contiguous allocation strategy that can exhibit good performance (e.g., a low job turnaround time and high system utilisation) with a low allocation overhead. The first part of the research presents a new contiguous allocation strategy, referred to as Turning Busy List (TBL), for 3D mesh-connected multicomputers. The TBL strategy considers only those available free sub-meshes which border from the left of those already allocated sub-meshes or which have their left boundaries aligned with that of the whole mesh network. Moreover TBL uses an efficient scheme to facilitate the detection of such available sub-meshes while maintaining a low allocation overhead. This is achieved through maintaining a list of allocated sub-meshes in order to efficiently determine the processors that can form an allocation sub-mesh for a new allocation request. The new strategy is able to identify a free sub-mesh of the requested size as long as it exists in the mesh. Results from extensive simulations under various operating loads reveal that TBL manages to deliver competitive performance (i.e., low turnaround times and high system utilisation) with a much lower allocation overhead compared to other well-known existing strategies. Most existing non-contiguous allocation strategies that have been suggested for the mesh suffer from several problems that include internal fragmentation, external fragmentation, and message contention inside the network. Furthermore, the allocation of processors to job requests is not based on free contiguous sub-meshes in these existing strategies. The second part of this research proposes a new non-contiguous allocation strategy, referred to as Greedy Available Busy List (GABL) strategy that eliminates both internal and external fragmentation and alleviates the contention in the network. GABL combines the desirable features of both contiguous and non-contiguous allocation strategies as it adopts the contiguous allocation used in our TBL strategy. Moreover, GABL is flexible enough in that it could be applied to either the 2D or 3D mesh. However, for the sake of the present study, the new non-contiguous allocation strategy is discussed for the 2D mesh and compares its performance against that of well-known non-contiguous allocation strategies suggested for this network. One of the desirable features of GABL is that it can maintain a high degree of contiguity between processors compared to the previous allocation strategies. This, in turn, decreases the number of sub-meshes allocated to a job, and thus decreases message distances, resulting in a low inter-processor communication overhead. The performance analysis here indicates that the new proposed strategy has lower turnaround time than the previous non-contiguous allocation strategies for most considered cases. Moreover, in the presence of high message contention due to heavy network traffic, GABL exhibits superior performance in terms of the turnaround time over the previous contiguous and non-contiguous allocation strategies. Furthermore, GABL exhibits a high system utilisation as it manages to eliminate both internal and external fragmentation. The performance of many allocation strategies including the ones suggested above, has been evaluated under the assumption that job execution times follow an exponential distribution. However, many measurement studies have convincingly demonstrated that the execution times of certain computational applications are best characterized by heavy-tailed job execution times; that is, many jobs have short execution times and comparatively few have very long execution times. Motivated by this observation, the final part of this thesis reviews the performance of several contiguous allocation strategies, including TBL, in the context of heavy-tailed distributions. This research is the first to analyze the performance impact of heavy-tailed job execution times on the allocation strategies suggested for mesh-connected multicomputers. The results show that the performance of the contiguous allocation strategies degrades sharply when the distribution of job execution times is heavy-tailed. Further, adopting an appropriate scheduling strategy, such as Shortest-Service-Demand (SSD) as opposed to First-Come-First-Served (FCFS), can significantly reduce the detrimental effects of heavy-tailed distributions. Finally, while the new contiguous allocation strategy (TBL) is as good as the best competitor of the previous contiguous allocation strategies in terms of job turnaround time and system utilisation, it is substantially more efficient in terms of allocation overhead

    Multi-Quality Auto-Tuning by Contract Negotiation

    Get PDF
    A characteristic challenge of software development is the management of omnipresent change. Classically, this constant change is driven by customers changing their requirements. The wish to optimally leverage available resources opens another source of change: the software systems environment. Software is tailored to specific platforms (e.g., hardware architectures) resulting in many variants of the same software optimized for different environments. If the environment changes, a different variant is to be used, i.e., the system has to reconfigure to the variant optimized for the arisen situation. The automation of such adjustments is subject to the research community of self-adaptive systems. The basic principle is a control loop, as known from control theory. The system (and environment) is continuously monitored, the collected data is analyzed and decisions for or against a reconfiguration are computed and realized. Central problems in this field, which are addressed in this thesis, are the management of interdependencies between non-functional properties of the system, the handling of multiple criteria subject to decision making and the scalability. In this thesis, a novel approach to self-adaptive software--Multi-Quality Auto-Tuning (MQuAT)--is presented, which provides design and operation principles for software systems which automatically provide the best possible utility to the user while producing the least possible cost. For this purpose, a component model has been developed, enabling the software developer to design and implement self-optimizing software systems in a model-driven way. This component model allows for the specification of the structure as well as the behavior of the system and is capable of covering the runtime state of the system. The notion of quality contracts is utilized to cover the non-functional behavior and, especially, the dependencies between non-functional properties of the system. At runtime the component model covers the runtime state of the system. This runtime model is used in combination with the contracts to generate optimization problems in different formalisms (Integer Linear Programming (ILP), Pseudo-Boolean Optimization (PBO), Ant Colony Optimization (ACO) and Multi-Objective Integer Linear Programming (MOILP)). Standard solvers are applied to derive solutions to these problems, which represent reconfiguration decisions, if the identified configuration differs from the current. Each approach is empirically evaluated in terms of its scalability showing the feasibility of all approaches, except for ACO, the superiority of ILP over PBO and the limits of all approaches: 100 component types for ILP, 30 for PBO, 10 for ACO and 30 for 2-objective MOILP. In presence of more than two objective functions the MOILP approach is shown to be infeasible

    Analyzing the Combined Effects of Measurement Error and Perturbation Error on Performance Measurement

    Get PDF
    Dynamic performance analysis of executing programs commonly relies on statistical profiling techniques to provide performance measurement results. When a program execution is sampled we learn something about the examined program, but also change, to some extent, the program's interaction with the underlying system and thus its behavior. The amount we learn diminishes (statistically) with each sample taken, while the change we affect with the intrusive sampling risks growing larger. Effectively sampling programs is challenging largely because of the opposing effects of the decreasing sampling error and increasing perturbation error. Achieving the highest overall level of confidence in measurement results requires striking an appropriate balance between the tensions inherent in these two types of errors. Despite the popularity of statistical profiling, published material typically only explains in general qualitative terms the motivation of the systematic sampling rates used. Given the importance of sampling, we argue in favor of the general principle of deliberate sample size selection and have developed and tested a technique for doing so. We present our idea of sample rate selection based on abstract and mathematical performance measurement models we developed that incorporate the effect of sampling on both measurement accuracy and perturbation effects. Our mathematical model predicts the sampling size at which the combination of the residual measurement error and the accumulating perturbation error is minimized. Our evaluation of the model with simulation, calibration programs, and selected programs from the SPEC CPU 2006 and SPEC OMP 2001 benchmark suites indicates that this idea has promise. Our results show that the predicted sample size is generally close to the best sampling rate and effectively avoids bad choices. Most importantly, adaptive sample rate selection is shown to perform better than a single selected rate in most cases

    Automatische Performanzoptimierung paralleler Architekturen

    Get PDF
    Die Parallelisierung von Programmen und deren Optimierung stellen Software-Entwickler vor große Herausforderungen. Diese Arbeit befasst sich daher mit Problemstellungen im Bereich der automatischen Performanzoptimierung (Auto-Tuning) paralleler Architekturen. Hierzu wird ein Verfahren für den Entwurf paralleler optimierbarer Architekturen vorgestellt; das Konzept eines suchbasierten Auto-Tuners rundet die Arbeit ab. Die Evaluationsergebnisse erweisen sich als äußerst vielversprechend
    corecore