7 research outputs found

    ОЦІНЮВАННЯ ЕНЕРГЕТИЧНИХ ВИТРАТ У ПРОЦЕСІ ВИКОРИСТАННЯ МОБІЛЬНИХ ПРИСТРОЇВ ДЛЯ ХМАРНИХ ОБЧИСЛЕНЬ

    Get PDF
    Сучасні обчислювальні задачі потребують зростання обчислювальних потужностей. Це викликає необхідність створення та виробництва нового обладнання для хмарних обчислень. Одночасно з цим кількість персональних мобільних пристроїв уже вимірюється мільярдами, і навіть часткове їх залучення могло б зменшити вимоги до виробництва. Крім того, мобільне апаратне забезпечення є більш енергоефективним, що сприяє значному заощадженню енергії. У статті досліджено питання якісної та кількісної оцінки ефективності використання мобільних пристроїв для обчислень порівняно з традиційними стаціонарними рішеннями. Мета роботи – обґрунтувати таку гіпотезу: обчислення в хмарі на основі мобільних пристроїв суттєво зменшує використання енергії, ніж обчислення на стаціонарному обладнанні. Для цього показано, що обчислення на мобільному графічному процесорі є більш енергетично ефективним, ніж обчислення на стаціонарному процесорі. Для визначення якісної переваги проаналізовано публічні джерела та бенчмарки. На основі досліджених даних обчислено показники ефективності для різноманітних мобільних і стаціонарних графічних процесорів. Аргументовано, що здебільшого мобільні рішення витрачають суттєво менше енергії порівняно з стаціонарними рішеннями. Для обчислення кількісної переваги проведено експеримент на основі двох платформ: мобільної та стаціонарної. Одну й ту саму обчислювальну задачу було реалізовано за допомогою Apple Metal та NVidia CUDA. На основі цієї задачі обчислено показники енергетичної ефективності мобільного й стаціонарного графічного професора. За результатами дослідження визначено суттєву перевагу мобільного графічного процесора в перерахунку на енергетичну ефективність. Цей результат є релевантним, оскільки платформи вийшли в один рік із різницею у кілька місяців, отже, їх можна вважати ровесницями одна одній. Надані підходи не враховують споживання всіх інших частин системи, крім графічних процесорів. Це означає, що споживання материнської плати, блоку живлення тощо можуть схилити перевагу на користь мобільного процесора ще більше. Але для розподілених обчислень дуже важливим є мережне з’єднання, що може споживати суттєву кількість енергії на мобільному пристрої. Подальші дослідження стосуватимуться більш всеосяжного обліку споживання енергії різними підсистемами комп’ютера

    Efficient structural outlooks for vertex product networks

    Get PDF
    In this thesis, a new classification for a large set of interconnection networks, referred to as "Vertex Product Networks" (VPN), is provided and a number of related issues are discussed including the design and evaluation of efficient structural outlooks for algorithm development on this class of networks. The importance of studying the VPN can be attributed to the following two main reasons: first an unlimited number of new networks can be defined under the umbrella of the VPN, and second some known networks can be studied and analysed more deeply. Examples of the VPN include the newly proposed arrangement-star and the existing Optical Transpose Interconnection Systems (OTIS-networks). Over the past two decades many interconnection networks have been proposed in the literature, including the star, hyperstar, hypercube, arrangement, and OTIS-networks. Most existing research on these networks has focused on analysing their topological properties. Consequently, there has been relatively little work devoted to designing efficient parallel algorithms for important parallel applications. In an attempt to fill this gap, this research aims to propose efficient structural outlooks for algorithm development. These structural outlooks are based on grid and pipeline views as popular structures that support a vast body of applications that are encountered in many areas of science and engineering, including matrix computation, divide-and- conquer type of algorithms, sorting, and Fourier transforms. The proposed structural outlooks are applied to the VPN, notably the arrangement-star and OTIS-networks. In this research, we argue that the proposed arrangement-star is a viable candidate as an underlying topology for future high-speed parallel computers. Not only does the arrangement-star bring a solution to the scalability limitations from which the Abstract existing star graph suffers, but it also enables the development of parallel algorithms based on the proposed structural outlooks, such as matrix computation, linear algebra, divide-and-conquer algorithms, sorting, and Fourier transforms. Results from a performance study conducted in this thesis reveal that the proposed arrangement-star supports efficiently applications based on the grid or pipeline structural outlooks. OTIS-networks are another example of the VPN. This type of networks has the important advantage of combining both optical and electronic interconnect technology. A number of studies have recently explored the topological properties of OTIS-networks. Although there has been some work on designing parallel algorithms for image processing and sorting, hardly any work has considered the suitability of these networks for an important class of scientific problems such as matrix computation, sorting, and Fourier transforms. In this study, we present and evaluate two structural outlooks for algorithm development on OTIS-networks. The proposed structural outlooks are general in the sense that no specific factor network or problem domain is assumed. Timing models for measuring the performance of the proposed structural outlooks are provided. Through these models, the performance of various algorithms on OTIS-networks are evaluated and compared with their counterparts on conventional electronic interconnection systems. The obtained results reveal that OTIS-networks are an attractive candidate for future parallel computers due to their superior performance characteristics over networks using traditional electronic interconnects

    Locality Transformations and Prediction Techniques for Optimizing Multicore Memory Performance

    Get PDF
    Chip Multiprocessors (CMPs) are here to stay for the foreseeable future. In terms of programmability of these processors what is different from legacy multiprocessors is that sharing among the different cores (processors) is less expensive than it was in the past. Previous research suggested that sharing is a desirable feature to be incorporated in new codes. For some programs, more cache leads to more beneficial sharing since the sharing starts to kick in for large on chip caches. This work tries to answer the question of whether or not we can (should) write code differently when the underlying chip microarchitecture is powered by a Chip Multiprocessor. We use a set three graph benchmarks each with three different input problems varying in size and connectivity to characterize the importance of how we partition the problem space among cores and how that partitioning can happen at multiple levels of the cache leading to better performance because of good utilization of the caches at the lowest level and because of the increased sharing of data items that can be boosted at the shared cache level (L2 in our case) which can effectively be a prefetching effect among different compute cores. The thesis has two thrusts. The first is exploring the design space represented by different parallelization schemes (we devise some tweaks on top of existing techniques) and different graph partitionings (a locality optimization techniques suited for graph problems). The combination of the parallelization strategy and graph partitioning provides a large and complex space that we characterize using detailed simulation results to see how much gain we can obtain over a baseline legacy parallelization technique with a partition sized to fit in the L1 cache. We show that the legacy parallelization is not the best alternative in most of the cases and other parallelization techniques perform better. Also, we show that there is a search problem to determine the partitioning size and in most of the cases the best partitioning size is smaller than the baseline partition size. The second thrust of the thesis is exploring how we can predict the best combination of parallelization and partitioning that performs the best for any given benchmark under any given input data set. We use a PIN based reuse distance profile computation tool to build an execution time prediction model that can rank order the different combinations of parallelization strategies and partitioning sizes. We report the amount of gain that we can capture using the PIN prediction relative to what detailed simulation results deem the best under a given benchmark and input size. In some cases the prediction is 100% accurate and in some other cases the prediction projects worse performance than the baseline case. We report the difference between the simulation best performing combination and the PIN predicted ones as well as other statistics to evaluate how good the predictions are. We show that the PIN prediction method performs very well in predicting the partition size compared to predicting the parallelization strategy. In this case, the accuracy of the overall scheme can be highly improved if we only use the partitioning size predicted by the PIN prediction scheme and then we use a search strategy to find the best parallelization strategy for that partition size. In this thesis, we use a detailed performance model to scan a large solution space for the best parameters for locality optimization of a set of graph problems. Using the M5 performance simulation we show gains of up to 20% vs. a naively picked baseline case. Our prediction scheme can achieve up to 100% of the best performance gains obtained using a search method on real hardware or performance simulation without running at all on the target hardware and up to 48% on average across all of our benchmarks and input sizes. There are several interesting aspects to this work. We are the first to devise and verify a performance model against detailed simulation results. We suggest and quantify that locality optimization and problem partitioning can increase sharing synergistically to achieve better performance overall. We have shown a new utilization for coherent reuse distance profiles as a helping tool for program developers and compilers to a optimize program's performance

    A unified model for inter- and intra-processor concurrency

    Get PDF
    Although concurrency is generally perceived to be a `hard' subject, it can in fact be very simple --- provided that the underlying model is simple. The occam-pi parallel processing language provides such a simple yet powerful concurrency model that is based on CSP and the pi-calculus. This thesis presents pony, the occam-pi Network Environment. occam-pi and pony provide a new, unified, concurrency model that bridges inter- and intra-processor concurrency. This enables the development of distributed applications in a transparent, dynamic and highly scalable way. The author specified the layout of the pony system as presented in this thesis, and carried out about 90% of the implementation. This thesis is structured into three main parts, as well as an introduction and an appendix. In the introduction, the need for a unified concurrency model is examined in detail. Thereupon, the pony environment is presented as a solution that provides such a unified model. The first part of this thesis is concerned with the usage of the pony environment for the development of distributed applications. It presents the interface between pony and the user-level code, as well as pony's configuration and a sample application. The second part presents the design and implementation of the pony environment. It explains the internal structure of pony, the implementation of pony's components and public processes, and the integration of pony in the KRoC compiler. The third part evaluates pony's performance and contains the final conclusions. It presents a number of performance tests and concludes with a discussion of the work presented in this thesis, along with an outline of possible future research.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    A unified model for inter- and intra-processor concurrency

    Get PDF
    Although concurrency is generally perceived to be a `hard' subject, it can in fact be very simple --- provided that the underlying model is simple. The occam-pi parallel processing language provides such a simple yet powerful concurrency model that is based on CSP and the pi-calculus. This thesis presents pony, the occam-pi Network Environment. occam-pi and pony provide a new, unified, concurrency model that bridges inter- and intra-processor concurrency. This enables the development of distributed applications in a transparent, dynamic and highly scalable way. The author specified the layout of the pony system as presented in this thesis, and carried out about 90% of the implementation. This thesis is structured into three main parts, as well as an introduction and an appendix. In the introduction, the need for a unified concurrency model is examined in detail. Thereupon, the pony environment is presented as a solution that provides such a unified model. The first part of this thesis is concerned with the usage of the pony environment for the development of distributed applications. It presents the interface between pony and the user-level code, as well as pony's configuration and a sample application. The second part presents the design and implementation of the pony environment. It explains the internal structure of pony, the implementation of pony's components and public processes, and the integration of pony in the KRoC compiler. The third part evaluates pony's performance and contains the final conclusions. It presents a number of performance tests and concludes with a discussion of the work presented in this thesis, along with an outline of possible future research

    Análisis de patrones de paralelismo bajo la óptica de las aplicaciones de cómputo científico sobre cluster de nodos multicore

    Get PDF
    Debido a que el tema de los patrones de paralelismo está aún en desarrollo, es de esperar que a partir del presente estudio surjan fundamentos que motiven cambios en los actuales patrones, los cuales podrían ser particularizados, ampliados, reducidos o modificados para adecuarse al tipo de aplicaciones bajo consideración, por lo que otro objetivo de la tesis es brindar un aporte sobre los elementos que deberían ser tenidos en cuenta a la hora de desarrollar patrones específicos para las aplicaciones bajo estudio; en particular, un tema de vital importancia es la eficiencia en el uso de los recursos computacionales, ya que de nada sirve paralelizar una aplicación y que su performance no se vea mejorada acorde a los recursos de hardware utilizados, por lo que patrones de paralelismo específicos al dominio bajo estudio deberían tener en cuenta algún tipo de indicación al respecto de la eficiencia. El alcance y pertinencia de los patrones de paralelismo presentados en este trabajo, ha sido corroborado sobre algunos algoritmos de cómputo científico mediante la realización de experimentos específicos, los cuales consistieron en la ejecución de programas que resulten de la aplicación de distintos patrones a cada uno de los algoritmos elegidos como ”testbed“. Sus resultados, que son presentados en el capítulo 4, permitieron constatar el alcance de la aplicación de los patrones sobre el nivel de rendimiento obtenido. Como consecuencia de la realización de estas experimentaciones buscando mejorar los resultados de los algoritmos paralelos, surgió un aporte adicional de la tesis, el cual es la presentación de un nuevo patrón de paralelismo, en una versión preliminar, el cual ha sido denominado ”Partial Computing Pattern“, debido a que busca optimizar los cómputos en paralelo aprovechando la disponibilidad de datos parciales en tiempos inactivos de los procesadores.Facultad de Informátic
    corecore