99 research outputs found

    Dos herramientas para la organización de los laboratorios de prácticas

    Get PDF
    Este artículo describe dos herramientas informáticas para la organización de laboratorios de prácticas y el acceso a la información de calificaciones, y recoge la experiencia en la aplicación de las mismas. La primera permite usar el correo electrónico para que los alumnos se apunten en los grupos de prácticas y obtengan las notas de sus evaluaciones de forma automatizada. La segunda consiste en el uso de formularios web que permiten ponerse de acuerdo a alumnos a los que no les conviene el horario de prácticas que se les ha asignado por defecto, de forma que puedan intercambiarse. Ambas tienen como ventaja frente a otras herramientas no requerir la instalación y mantenimiento de un portal web y el software que suele conllevar. Comentarios informales, en el caso de la primera herramienta, y una encuesta, en el caso de la segunda, demuestran la valoración positiva de las mismas por parte del alumnado.Peer Reviewe

    Easy Dataflow Programming in Clusters with UPC++ DepSpawn

    Get PDF
    Versión final aceptada de: https://doi.org/10.1109/TPDS.2018.2884716This version of the article has been accepted for publication, after peer review. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The Version of Record is available online at: https://doi.org/10.1109/TPDS.2018.2884716[Abstract]: The Partitioned Global Address Space (PGAS) programming model is one of the most relevant proposals to improve the ability of developers to exploit distributed memory systems. However, despite its important advantages with respect to the traditional message-passing paradigm, PGAS has not been yet widely adopted. We think that PGAS libraries are more promising than languages because they avoid the requirement to (re)write the applications using them, with the implied uncertainties related to portability and interoperability with the vast amount of APIs and libraries that exist for widespread languages. Nevertheless, the need to embed these libraries within a host language can limit their expressiveness and very useful features can be missing. This paper contributes to the advance of PGAS by enabling the simple development of arbitrarily complex task-parallel codes following a dataflow approach on top of the PGAS UPC++ library, implemented in C++. In addition, our proposal, called UPC++ DepSpawn, relies on an optimized multithreaded runtime that provides very competitive performance, as our experimental evaluation shows.This research was supported by the Ministerio de Economía, Industria y Competitividad of Spain and FEDER funds of the EU (TIN2016-75845-P), and by the Xunta de Galicia co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04) as well as under the Centro Singular de Investigación de Galicia accreditation 2016-2019 (ED431G/01). We also acknowledge the Centro de Supercomputación de Galicia (CESGA) for the use of their computers.Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/0

    A Software Cache Autotuning Strategy for Dataflow Computing with UPC++ DepSpawn

    Get PDF
    This is the accepted version of the following article: B. B. Fraguela, D. Andrade. A software cache autotuning strategy for dataflow computing with UPC++ DepSpawn. Computational and Mathematical Methods, 3(6), e1148. November 2021, which has been published in final form at http://dx.doi.org/10.1002/cmm4.1148. This article may be used for noncommercial purposes in accordance with the Wiley Self-Archiving Policy [http://www.wileyauthors.com/self-archiving].[Abstract] Dataflow computing allows to start computations as soon as all their dependencies are satisfied. This is particularly useful in applications with irregular or complex patterns of dependencies which would otherwise involve either coarse grain synchronizations which would degrade performance, or high programming costs. A recent proposal for the easy development of performant dataflow algorithms in hybrid shared/distributed memory systems is UPC++ DepSpawn. Among the many techniques it applies to provide good performance is a software cache that minimizes the communications among the processes involved. In this article we provide the details of the implementation and operation of this cache and we present an autotuning strategy that simplifies its usage by freeing the user from having to estimate an adequate size for this cache. Rather, the runtime is now able to define reasonably sized caches that provide near optimal behavior.This research was funded by the Ministry of Science and Innovation of Spain (TIN2016-75845-P and PID2019-104184RB-I00, AEI/FEDER/EU, 10.13039/501100011033), and by the Xunta de Galicia co-funded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04). The authors acknowledge also the support from the Centro Singular de Investigación de Galicia “CITIC,” funded by Xunta de Galicia and the European Union (European Regional Development Fund- Galicia 2014-2020 Program), by grant ED431G 2019/01. They also acknowledge the Centro de Supercomputación de Galicia (CESGA) for the use of its computersXunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/0

    Accelerating the HyperLogLog Cardinality Estimation Algorithm

    Get PDF
    [Abstract] In recent years, vast amounts of data of different kinds, from pictures and videos from our cameras to software logs from sensor networks and Internet routers operating day and night, are being generated. This has led to new big data problems, which require new algorithms to handle these large volumes of data and as a result are very computationally demanding because of the volumes to process. In this paper, we parallelize one of these new algorithms, namely, the HyperLogLog algorithm, which estimates the number of different items in a large data set with minimal memory usage, as it lowers the typical memory usage of this type of calculation from () to (1). We have implemented parallelizations based on OpenMP and OpenCL and evaluated them in a standard multicore system, an Intel Xeon Phi, and two GPUs from different vendors. The results obtained in our experiments, in which we reach a speedup of 88.6 with respect to an optimized sequential implementation, are very positive, particularly taking into account the need to run this kind of algorithm on large amounts of data.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds (80%) of the EU (Projects TIN2013-42148-P and TIN2016-75845-P) as well as by the Xunta de Galicia (Centro Singular de Investigación de Galicia accreditation 2016–2019) and the European Union (European Regional Development Fund, ERDF) under Grant Ref. ED431G/01Xunta de Galicia; ED431G/0

    High-performance dataflow computing in hybrid memory systems with UPC++ DepSpawn

    Get PDF
    [Abstract]: Dataflow computing is a very attractive paradigm for high-performance computing, given its ability to trigger computations as soon as their inputs are available. UPC++ DepSpawn is a novel task-based library that supports this model in hybrid shared/distributed memory systems on top of a Partitioned Global Address Space environment. While the initial version of the library provided good results, it suffered from a key restriction that heavily limited its performance and scalability. Namely, each process had to consider all the tasks in the application rather than only those of interest to it, an overhead that naturally grows with both the number of processes and tasks in the system. In this paper, this restriction is lifted, enabling our library to provide higher levels of performance. This way, in experiments using 768 cores the performance improved up to 40.1%, the average improvement being 16.1%.Ministerio de Ciencia e Innovación; TIN2016-75845-PMinisterio de Ciencia e Innovación; PID2019-104184RB-I00Ministerio de Ciencia e Innovación; 10.13039/501100011033Xunta de Galicia; ED431C 2017/0

    A general and efficient divide-and-conquer algorithm framework for multi-core clusters

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Cluster Computing. The final authenticated version is available online at: https://doi.org/10.1007/s10586-017-0766-y[Abstract]Divide-and-conquer is one of the most important patterns of parallelism, being applicable to a large variety of problems. In addition, the most powerful parallel systems available nowadays are computer clusters composed of distributed-memory nodes that contain an increasing number of cores that share a common memory. The optimal exploitation of these systems often requires resorting to a hybrid model that mimics the underlying hardware by combining a distributed and a shared memory parallel programming model. This results in longer development times and increased maintenance costs. In this paper we present a very general skeleton library that allows to parallelize any divide-and-conquer problem in hybrid distributed-shared memory systems with little effort while providing much flexibility and good performance. Our proposal combines a message-passing paradigm at the process level and a threaded model inside each process, hiding the related complexity from the user. The evaluation shows that this skeleton provides performance comparable, and often better than that of manually optimized codes while requiring considerably less effort when parallelizing applications on multi-core clusters.Ministerio de Economía y Competitividad; TIN2013-42148-PMinisterio de Economía y Competitividad; TIN2016-75845-PXunta de Galicia; GRC2013/05

    A framework for argument-based task synchronization with automatic detection of dependencies

    Get PDF
    [Abstract] Synchronization in parallel applications can be achieved either implicitly or explicitly. Implicit synchronization is typical of programming environments that provide predefined, and often simple, patterns of parallelism such as data-parallel libraries and languages and skeletal operations. Nevertheless, more flexible approaches that allow to express arbitrary task-level parallel computations without a predefined structure request in turn that the user explicitly specifies the synchronization needed among the parallel tasks. In this paper we present a library-based approach that enables arbitrary patterns of parallelism with minimal effort for the user. Our proposal is the first generic approach to express parallelism we know of that requires neither explicit synchronizations nor a detail of the dependencies of the parallel tasks. Our strategy relies on expressing the parallel tasks as functions that convey their dependencies implicitly by means of their arguments. These function arguments are analyzed by our library, called DepSpawn, when a parallel task is spawned in order to enforce its dependencies. Our experiments indicate that DepSpawn is very competitive, both in terms of performance and programmability, with respect to a widespread high-level approach like OpenMP.Xunta de Galicia; INCITE08PXIB105161PRMinisterio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación de España; AP2009-475

    Guiding the Optimization of Parallel Codes on Multicores Using an Analytical Cache Model

    Get PDF
    Versión final aceptada de: https://doi.org/10.1007/978-3-319-93713-7_32This is a post-peer-review, pre-copyedit version of an article published in Lecture Notes on Computer Science (ICCS 2018 proceedings). The final authenticated version is available online at: http://dx.doi.org/10.1007/978-3-319-93713-7_32[Abstract]: Cache performance is particularly hard to predict in modern multicore processors as several threads can be concurrently in execution, and private cache levels are combined with shared ones. This paper presents an analytical model able to evaluate the cache performance of the whole cache hierarchy for parallel applications in less than one second taking as input their source code and the cache configuration. While the model does not tackle some advanced hardware features, it can help optimizers to make reasonably good decisions in a very short time. This is supported by an evaluation based on two modern architectures and three different case studies, in which the model predictions differ on average just 5.05% from the results of a detailed hardware simulator and correctly guide different optimization decisions.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds (80%) of the EU (TIN2016-75845-P), and by the Government of Galicia (Xunta de Galicia) co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04) as well as under the Centro Singular de Investigación de Galicia accreditation 2016-2019 (ED431G/01). We also acknowledge the Centro de Supercomputación de Galicia (CESGA) for the use of their computers.Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/0
    corecore