11 research outputs found

    BSP Functional Programming: Examples of a Cost Based Methodology

    Full text link
    Abstract. Bulk-Synchronous Parallel ML (BSML) is a functional data-parallel language for the implementation of Bulk-Synchronous Parallel (BSP) algorithms. It makes an estimation of the execution time (cost) possible. This paper presents some general examples of BSML programs and a comparison of their predicted costs with the measured execution time on a parallel machine

    Adding Data Parallelism to Streaming Pipelines for Throughput Optimization

    Get PDF
    The streaming model is a popular model for writing high-throughput parallel applications. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In this report, we consider the problem of mapping streaming pipelines — streaming applications where the graph is a linear chain — in order to maximize throughput. In a parallel setting, subsets of stages, called components can be mapped onto different computing resources. The through-put of an application is determined by the throughput of the slowest component. Therefore, if some stage is much slower than others, then it may be useful to replicate the stage’s code and divide its workload among two or more replicas in order to increase throughput. However, pipelines may consist of some replicable and some non-replicable stages. In this paper, we address the problem of mapping these partially replicable streaming pipelines on both homogeneous and heterogeneous platforms so as to maximize throughput. We consider two types of platforms, homogeneous platforms — where all resources are identical, and heterogeneous platforms — where resources may have different speeds. In both cases, we consider two network topologies — unidirectional chain and clique. We provide polynomial-time algorithms for mapping partially replicable pipelines onto unidirectional chains for both homogeneous and heterogeneous platforms. For homogeneous platforms, the algorithm for unidirectional chains generalizes to clique topologies. However, for heterogeneous platforms, mapping these pipelines onto clique topologies is NP-complete. We provide heuristics to generate solutions for cliques by applying our chain algorithms to a series of chains sampled from the clique. Our empirical results show that these heuristics rapidly converge to near-optimal solutions

    Sharing resources for performance and energy optimization of concurrent streaming applications

    Get PDF
    We aim at finding optimal mappings for concurrent streaming applications. Each application consists of a linear chain with several stages, and processes successive data sets in pipeline mode. The objective is to minimize the energy consumption of the whole platform, while satisfying given performance-related bounds on the period and latency of each application. The problem is to decide which processors to enroll, at which speed (or mode) to use them, and which stages they should execute. Processors can be identical (with the same modes) or heterogeneous. We also distinguish two mapping categories, interval mappings, and general mappings. For interval mappings, a processor is assigned a set of consecutive stages of the same application, so there is no resource sharing across applications. On the contrary, the assignment is fully arbitrary for general mappings, hence a processor can be reused for several applications. On the theoretical side, we establish complexity results for this tri-criteria mapping problem (energy, period, latency), classifying polynomial versus NP-complete instances. Furthermore, we derive an integer linear program that provides the optimal solution in the most general case. On the experimental side, we design polynomial-time heuristics, and assess their absolute performance thanks to the linear program. One main goal is to assess the impact of processor sharing on the quality of the solution

    Locality-Aware Concurrency Platforms

    Get PDF
    Modern computing systems from all domains are becoming increasingly more parallel. Manufacturers are taking advantage of the increasing number of available transistors by packaging more and more computing resources together on a single chip or within a single system. These platforms generally contain many levels of private and shared caches in addition to physically distributed main memory. Therefore, some memory is more expensive to access than other and high-performance software must consider memory locality as one of the first level considerations. Memory locality is often difficult for application developers to consider directly, however, since many of these NUMA affects are invisible to the application programmer and only show up in low performance. Moreover, on parallel platforms, the performance depends on both locality and load balance and these two metrics are often at odds with each other. Therefore, directly considering locality and load balance at the application level may make the application much more complex to program. In this work, we develop locality-conscious concurrency platforms for multiple different structured parallel programming models, including streaming applications, task-graphs and parallel for loops. In all of this work, the idea is to minimally disrupt the application programming model so that the application developer is either unimpacted or must only provide high-level hints to the runtime system. The runtime system then schedules the application to provide good locality of access while, at the same time also providing good load balance. In particular, we address cache locality for streaming applications through static partitioning and developed an extensible platform to execute partitioned streaming applications. For task-graphs, we extend a task-graph scheduling library to guide scheduling decisions towards better NUMA locality with the help of user-provided locality hints. CilkPlus parallel for loops utilize a randomized dynamic scheduler to distribute work which, in many loop based applications, results in poor locality at all levels of the memory hierarchy. We address this issue with a novel parallel for loop implementation that can get good cache and NUMA locality while providing support to maintain good load balance dynamically

    A Model-based Design Framework for Application-specific Heterogeneous Systems

    Get PDF
    The increasing heterogeneity of computing systems enables higher performance and power efficiency. However, these improvements come at the cost of increasing the overall complexity of designing such systems. These complexities include constructing implementations for various types of processors, setting up and configuring communication protocols, and efficiently scheduling the computational work. The process for developing such systems is iterative and time consuming, with no well-defined performance goal. Current performance estimation approaches use source code implementations that require experienced developers and time to produce. We present a framework to aid in the design of heterogeneous systems and the performance tuning of applications. Our framework supports system construction: integrating custom hardware accelerators with existing cores into processors, integrating processors into cohesive systems, and mapping computations to processors to achieve overall application performance and efficient hardware usage. It also facilitates effective design space exploration using processor models (for both existing and future processors) that do not require source code implementations to estimate performance. We evaluate our framework using a variety of applications and implement them in systems ranging from low power embedded systems-on-chip (SoC) to high performance systems consisting of commercial-off-the-shelf (COTS) components. We show how the design process is improved, reducing the number of design iterations and unnecessary source code development ultimately leading to higher performing efficient systems

    Mapping Pipeline Skeletons onto Heterogeneous Platforms

    Get PDF
    Mapping applications onto parallel platforms is a challenging problem, that becomes even more difficult when platforms are heterogeneous –nowadays a standard assumption. A high-level approach to parallel programming not only eases the application developer’s task, but it also provides additional information which can help realize an efficient mapping of the application. In this paper, we discuss the mapping of pipeline skeletons onto different types of platforms: Fully Homogeneous platforms with identical processors and interconnection links; Communication Homogeneous platforms, with identical links but different speed processors; and finally, Fully Heterogeneous platforms. We assume that a pipeline stage must be mapped on a single processor, and we establish new theoretical complexity results for different mapping policies: a mapping can be required to be one-to-one (a processor is assigned at most one stage), or interval-based (a processor is assigned an interval of consecutive stages), or fully general. In particular, we show taht determining the optimal interval-based mapping is NP-hard for Communication Homogeneous platfors, and this result assesses the complexity of the well-known chains-to-chains problem for different-speed processors. We provide several efficient polynomial heuristics for the most important policy/platform combination, namely interval-based mappings on Communication Homogeneous platforms. These heuristics are compared to the optimal result provided by the formulation of the problem in terms of the solution of an integer linear program, for small problem instances

    The Anesthesia Continuing Education Market and the Value Creation From a Sustainable Unified Platform

    Get PDF
    Practicing anesthesia professionals in the United States are all governed by various profession-specific regulatory bodies that mandate continuing education (CE) requirements. To date, no unified resource exists for anesthesia professionals (i.e., Anesthesiologists, Certified Registered Nurse Anesthetists, and Anesthesiologist Assistants) to explore the CE offerings available within the marketplace. This study endeavored to convey the potential value of a unified anesthesia CE resource. It investigated how to cultivate a sustainable platform to potentially improve how anesthesia professionals search available CE offerings and to potentially enhance how anesthesia CE providers reach anesthesia professionals. This qualitative study was conducted utilizing an integrative review of the literature. The key concepts identified and investigated were network effect, segmentation, first to market, best of breed, search costs, transaction costs, minimally viable product, evolutionary phases of platforms, platform theory, platform business model, platform economy, and types of platforms. Inductive content analysis was chosen as the organizational method for the resultant qualitative data. The goal of the analysis was to create a conceptual, practical, and strategically applicable platform paradigm for the anesthesia CE marketplace driven by the insights and amalgamations from the literature. The analyzed concepts, dimensions, and indicators of platform successes and their applications potentially facilitate anesthesia professionals’ CE explorations and CE providers’ marketing efforts, as well as contextualize the overarching impacts and implications onto the anesthesia CE industry and beyond. The conclusion portrays these impacts and implications

    Multi-kritäres Mapping und Scheduling von Workflow-Anwendungen auf heterogenen Plattformen

    Get PDF
    The results summarized in this thesis deal with the mapping and scheduling of workflow applications on heterogeneous platforms. In this context, we focus on three different types of streaming applications: * Replica placement in tree networks * In this kind of application, clients are issuing requests to some servers and the question is where to place replicas in the network such that all requests can be processed. We discuss and compare several policies to place replicas in tree networks, subject to server capacity, Quality of Service (QoS) and bandwidth constraints. The client requests are known beforehand, while the number and location of the servers have to be determined. The standard approach in the literature is to enforce that all requests of a client be served by the closest server in the tree. We introduce and study two new policies. One major contribution of this work is to assess the impact of these new policies on the total replication cost. Another important goal is to assess the impact of server heterogeneity, both from a theoretical and a practical perspective. We establish several new complexity results, and provide several efficient polynomial heuristics for NP-complete instances of the problem. * Pipeline workflow applications * We consider workflow applications that can be expressed as linear pipeline graphs. An example for this application type is digital image processing, where images are treated in steady-state mode. Several antagonist criteria should be optimized, such as throughput and latency (or a combination) as well as latency and reliability (i.e., the probability that the computation will be successful) of the application. While simple polynomial algorithms can be found for fully homogeneous platforms, the problem becomes NP-hard when tackling heterogeneous platforms. We present an integer linear programming formulation for this latter problem. Furthermore, we provide several efficient polynomial bi-criteria heuristics, whose relative performances are evaluated through extensive simulation. As a case-study, we provide simulations and MPI experimental results for the JPEG encoder application pipeline on a cluster of workstations. * Complex streaming applications * We consider the execution of applications structured as trees of operators, i.e., the application of one or several trees of operators in steady-state to multiple data objects that are continuously updated at various locations in a network. A first goal is to provide the user with a set of processors that should be bought or rented in order to ensure that the application achieves a minimum steady-state throughput, and with the objective of minimizing platform cost. We then extend our model to multiple applications: several concurrent applications are executed at the same time in a network, and one has to ensure that all applications can reach their application throughput. Another contribution of this work is to provide complexity results for different instances of the basic problem, as well as integer linear program formulations of various problem instances. The third contribution is the design of several polynomial-time heuristics, for both application models. One of the primary objectives of the heuristics for concurrent applications is to reuse intermediate results shared by multiple applications.In meiner Dissertation beschäftige ich mich mit dem Scheduling von Workflow-Anwendungen in heterogenen Plattformen. In diesem Zusammenhang konzentriere ich mich auf drei verschiene Anwendungstypen.: * Platzierung von Replikaten in Baumnetzwerken * Dieses erste Schedulingproblem behan-delt die Platzierung von Replikaten in Baumnetzwerken. Ein Beispiel hierfür ist die Platzierung von Replikaten in verteilten Datenbanksystemen, deren Verbindungsstruktur baumartig organi-siert ist. Die Platzierung soll dabei unter mehreren Constraints (Serverkapazitäten, sowie Dienstgüte und Bandbreitenbeschränkungen) durchgeführt werden. In diesem Anwendungstyp stellen Clients Anfragen an verschiedene Server. Diese Client-Anfragen sind im Voraus bekannt, während Anzahl und Platzierung der Server erst ermittelt werden müssen. Die in der Literatur gängige Strategie fordert, dass alle Anfragen eines Clients vom nächstgelegenen Server im Baum behandelt werden. Es werden zwei neue Verfahrensweisen vorgestellt und untersucht. Ein wichtiges Teilergebnis dieser Studie bewertet die Auswirkung der beiden neuen Strategien auf die globalen Replikationskosten. Ausserdem wird der Einfluss von Heterogenität aus theore-tischer und praktischer Sicht untersucht. Es werden verschiedene Komplexitätsergebnisse erar-beitet und mehrere effiziente Polynomialzeit-Heuristiken für NP-vollständige Instanzen des Problems vorgestellt. * Lineare Workflow-Anwendungen * Als nächstes werden Workflow-Anwendungen untersucht, die als lineare Graphen dargestellt werden können. Ein Beispiel dieses Applikationstyps ist die digitale Bildverarbeitung, in der Bilder mittels einer Pipeline verarbeitet werden. Es sollen ver-schie¬dene gegensätzliche Kriterien optimiert werden, wie zum Beispiel Durchsatz und Latenz-zeit, beziehungsweise eine Kombination der beiden, aber auch Latenzzeit und Ausfallsicherheit der Anwendung. Während für vollhomogene Plattformen polynomiale Algorithmen gefunden werden können, wird das Problem NP-hart, sobald heterogene Plattformen angestrebt werden. Diese Arbeit beinhaltet eine vollständige Komplexitätsanalyse. Für die bisher unbekannten polynomialen Varianten des Problems werden optimale Algorithmen vorgeschlagen. Ein ganz-zahliges lineares Programm für das bekannte „chains-on-chains“ Problem für heterogene Plattformen wird vorgestellt. Des weiteren werden verschiedene effiziente polynomiale bi-kritäre Heuristiken präsentiert, deren relative Effizienz durch umfangreiche Simulationen eruiert werden. Eine Fallstudie beschäftigt sich mit der JPEG-Encoder-Pipeline. Hierbei werden Simulationen und MPI-basierte Auswertungen auf einem Rechen-Cluster erstellt. * Komplexe Streaming-Anwendungen * Als letztes wird die Ausführung von Anwendungen, die als Operator-Bäume strukturiert sind, untersucht. Konkret bedeutet dies, dass ein oder mehrere Operator-Bäume in stationärem Zustand auf mannigfaltige Datenobjekte angewendet werden, welche fortlaufend an verschiedenen Stellen im Netzwerk aktualisiert werden. Ein erstes Ziel ist, dem Benutzer eine Gruppe von Rechnern vorzuschlagen, die gekauft oder gemietet werden sollen, so dass die Anwendung einen minimalen stationären Durchsatz erzielt und gleichzeitig Plattformkosten minimiert werden können. Anschließend wird das Modell auf mehrere Anwendungen erweitert: verschiedene nebenläufige Anwendungen werden zeitgleich in einem Netzwerk ausgeführt und es muss sichergestellt werden, dass alle Anwendungen ihren Durchsatz erreichen können. Beide Modelle werden aus theoretischer Sicht untersucht und eine Komplexitäts-analyse für unterschiedliche Instanzen des Grundproblems, sowie Formulierungen als lineare Programme erstellt. Für beide Anwendungsmodelle werden verschiedene Polynomialzeit-Heuristiken präsentiert und charakterisiert. Ein Hauptziel der Heuristiken für nebenläufige Anwendungen ist die Wiederverwertung von Zwischenergebnissen, welche von mehreren Anwedungen geteilt werden

    International Workshop on MicroFactories (IWMF 2012): 17th-20th June 2012 Tampere Hall Tampere, Finland

    Get PDF
    This Workshop provides a forum for researchers and practitioners in industry working on the diverse issues of micro and desktop factories, as well as technologies and processes applicable for micro and desktop factories. Micro and desktop factories decrease the need of factory floor space, and reduce energy consumption and improve material and resource utilization thus strongly supporting the new sustainable manufacturing paradigm. They can be seen also as a proper solution to point-of-need manufacturing of customized and personalized products near the point of need
    corecore