125 research outputs found
Multi-threading a state-of-the-art maximum clique algorithm
We present a threaded parallel adaptation of a state-of-the-art maximum clique
algorithm for dense, computationally challenging graphs. We show that near-linear speedups
are achievable in practice and that superlinear speedups are common. We include results for
several previously unsolved benchmark problems
A Survey on Compiler Autotuning using Machine Learning
Since the mid-1990s, researchers have been trying to use machine-learning
based approaches to solve a number of different compiler optimization problems.
These techniques primarily enhance the quality of the obtained results and,
more importantly, make it feasible to tackle two main compiler optimization
problems: optimization selection (choosing which optimizations to apply) and
phase-ordering (choosing the order of applying optimizations). The compiler
optimization space continues to grow due to the advancement of applications,
increasing number of compiler optimizations, and new target architectures.
Generic optimization passes in compilers cannot fully leverage newly introduced
optimizations and, therefore, cannot keep up with the pace of increasing
options. This survey summarizes and classifies the recent advances in using
machine learning for the compiler optimization field, particularly on the two
major problems of (1) selecting the best optimizations and (2) the
phase-ordering of optimizations. The survey highlights the approaches taken so
far, the obtained results, the fine-grain classification among different
approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our
Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated
quarterly here (Send me your new published papers to be added in the
subsequent version) History: Received November 2016; Revised August 2017;
Revised February 2018; Accepted March 2018
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
Data Grids have been adopted as the platform for scientific communities that
need to share, access, transport, process and manage large data collections
distributed worldwide. They combine high-end computing technologies with
high-performance networking and wide-area storage management techniques. In
this paper, we discuss the key concepts behind Data Grids and compare them with
other data sharing and distribution paradigms such as content delivery
networks, peer-to-peer networks and distributed databases. We then provide
comprehensive taxonomies that cover various aspects of architecture, data
transportation, data replication and resource allocation and scheduling.
Finally, we map the proposed taxonomy to various Data Grid systems not only to
validate the taxonomy but also to identify areas for future exploration.
Through this taxonomy, we aim to categorise existing systems to better
understand their goals and their methodology. This would help evaluate their
applicability for solving similar problems. This taxonomy also provides a "gap
analysis" of this area through which researchers can potentially identify new
issues for investigation. Finally, we hope that the proposed taxonomy and
mapping also helps to provide an easy way for new practitioners to understand
this complex area of research.Comment: 46 pages, 16 figures, Technical Repor
Vcluster: A Portable Virtual Computing Library For Cluster Computing
Message passing has been the dominant parallel programming model in cluster computing, and libraries like Message Passing Interface (MPI) and Portable Virtual Machine (PVM) have proven their novelty and efficiency through numerous applications in diverse areas. However, as clusters of Symmetric Multi-Processor (SMP) and heterogeneous machines become popular, conventional message passing models must be adapted accordingly to support this new kind of clusters efficiently. In addition, Java programming language, with its features like object oriented architecture, platform independent bytecode, and native support for multithreading, makes it an alternative language for cluster computing. This research presents a new parallel programming model and a library called VCluster that implements this model on top of a Java Virtual Machine (JVM). The programming model is based on virtual migrating threads to support clusters of heterogeneous SMP machines efficiently. VCluster is implemented in 100% Java, utilizing the portability of Java to address the problems of heterogeneous machines. VCluster virtualizes computational and communication resources such as threads, computation states, and communication channels across multiple separate JVMs, which makes a mobile thread possible. Equipped with virtual migrating thread, it is feasible to balance the load of computing resources dynamically. Several large scale parallel applications have been developed using VCluster to compare the performance and usage of VCluster with other libraries. The results of the experiments show that VCluster makes it easier to develop multithreading parallel applications compared to conventional libraries like MPI. At the same time, the performance of VCluster is comparable to MPICH, a widely used MPI library, combined with popular threading libraries like POSIX Thread and OpenMP. In the next phase of our work, we implemented thread group and thread migration to demonstrate the feasibility of dynamic load balancing in VCluster. We carried out experiments to show that the load can be dynamically balanced in VCluster, resulting in a better performance. Thread group also makes it possible to implement collective communication functions between threads, which have been proved to be useful in process based libraries
Hardware-Aware Algorithm Designs for Efficient Parallel and Distributed Processing
The introduction and widespread adoption of the Internet of Things, together with emerging new industrial applications, bring new requirements in data processing. Specifically, the need for timely processing of data that arrives at high rates creates a challenge for the traditional cloud computing paradigm, where data collected at various sources is sent to the cloud for processing. As an approach to this challenge, processing algorithms and infrastructure are distributed from the cloud to multiple tiers of computing, closer to the sources of data. This creates a wide range of devices for algorithms to be deployed on and software designs to adapt to.In this thesis, we investigate how hardware-aware algorithm designs on a variety of platforms lead to algorithm implementations that efficiently utilize the underlying resources. We design, implement and evaluate new techniques for representative applications that involve the whole spectrum of devices, from resource-constrained sensors in the field, to highly parallel servers. At each tier of processing capability, we identify key architectural features that are relevant for applications and propose designs that make use of these features to achieve high-rate, timely and energy-efficient processing.In the first part of the thesis, we focus on high-end servers and utilize two main approaches to achieve high throughput processing: vectorization and thread parallelism. We employ vectorization for the case of pattern matching algorithms used in security applications. We show that re-thinking the design of algorithms to better utilize the resources available in the platforms they are deployed on, such as vector processing units, can bring significant speedups in processing throughout. We then show how thread-aware data distribution and proper inter-thread synchronization allow scalability, especially for the problem of high-rate network traffic monitoring. We design a parallelization scheme for sketch-based algorithms that summarize traffic information, which allows them to handle incoming data at high rates and be able to answer queries on that data efficiently, without overheads.In the second part of the thesis, we target the intermediate tier of computing devices and focus on the typical examples of hardware that is found there. We show how single-board computers with embedded accelerators can be used to handle the computationally heavy part of applications and showcase it specifically for pattern matching for security-related processing. We further identify key hardware features that affect the performance of pattern matching algorithms on such devices, present a co-evaluation framework to compare algorithms, and design a new algorithm that efficiently utilizes the hardware features.In the last part of the thesis, we shift the focus to the low-power, resource-constrained tier of processing devices. We target wireless sensor networks and study distributed data processing algorithms where the processing happens on the same devices that generate the data. Specifically, we focus on a continuous monitoring algorithm (geometric monitoring) that aims to minimize communication between nodes. By deploying that algorithm in action, under realistic environments, we demonstrate that the interplay between the network protocol and the application plays an important role in this layer of devices. Based on that observation, we co-design a continuous monitoring application with a modern network stack and augment it further with an in-network aggregation technique. In this way, we show that awareness of the underlying network stack is important to realize the full potential of the continuous monitoring algorithm.The techniques and solutions presented in this thesis contribute to better utilization of hardware characteristics, across a wide spectrum of platforms. We employ these techniques on problems that are representative examples of current and upcoming applications and contribute with an outlook of emerging possibilities that can build on the results of the thesis
Faculty Publications and Creative Works 2004
Faculty Publications & Creative Works is an annual compendium of scholarly and creative activities of University of New Mexico faculty during the noted calendar year. Published by the Office of the Vice President for Research and Economic Development, it serves to illustrate the robust and active intellectual pursuits conducted by the faculty in support of teaching and research at UNM
Fault-tolerance and malleability in parallel message-passing applications
[Resumo]
Esta tese explora solucións para tolerancia a fallos e maleabilidade baseadas en
técnicas de checkpoint e reinicio para aplicacións de pase de mensaxes. No campo
da tolerancia a fallos, esta tese contribúe melloraudo o factor que máis incrementa
a sobrecarga, o custo de E/S no envorcado dos ficheiros de estado, propoñendo diferentes
técnicas para reducir o tamaño dos ficheiros de checkpoint. Ademais, tamén
se propón un mecanismo de migración de procesos baseado en checkpointing. Esto
permite a migración proactiva de procesos desde nodos que están a piques de fallar,
evitando un reinicio completo da execución e melloraudo a resistencia a fallos da
aplicación. Finalmente, esta tese presenta unha proposta para transformar de forma
transparente aplicacións MPI en traballos maleables. Esto é, programas paralelos
que en tempo de execución son capaces de adaptarse so número de procesadores
dispoñibles no sistema, conseguindo beneficios, como maior productividade, mellor
tempo de resposta ou maior resistencia a fallos nos nodos.
Todas as solucióru; propostas nesta tese foron implementadas a nivel de aplicación,
e son independentes da arquitectura hardware, o sistema operativo, a implementación
MPI usada, e de calquera framework de alto nivel, como os utilizados
para o envío de traballos.[Resumen]
Esta tesis explora soluciones de tolerancia a fallos y maleabilidad basadas en
técnicas de checkpoint y reinicio para aplicaciones de pase de mensajes. En el campo
de la tolerancia a fallos, contribuye mejorando el factor que más incrementa la
sobrecarga, el coste de E/S en el volcado de los ficheros de estado, proponiendo
diferentes técnicas para reducir el tamaño de los ficheros de checkpoint. Ademós,
también se propone nn mecanismo de migración de procesos basado en checkpointing.
Esto permite la migración proactiva de procesos desde nodos que están a punto
de fallar, evitando un reinicio completo de la ejecución y mejorando la resistencia
a fallos de la aplicación. Finalmente, se presenta una propuesta para transformar
de forma transparente aplicaciones MPI en trabajos maleables. Esto es, programas
paralelos que en tiempo de ejecución son capaces de adaptarse al número de procesadores
disponibles en el sistema, consiguiendo beneficios, como mayor productividad,
mejor tiempo de respuesta y mayor resistencia a fallos en los nodos.
Todas las soluciones propuestas han sido implementadas a nivel de aplicación,
siendo independientes de la arquitectura hardware, el sistema operativo, la implementación
MPI usada y de cualquier framework de alto nivel, como los utilizados
para el envío de trabajos.[Abstract]
This Thesis focuses on exploring fault-tolerant and malleability solutions, based
on checkpoint and restart techniques, for parallel message-passing applications. In
the fault-tolerant field, tbis Thesis contributes to improving the most important
overhead factor in checkpointing perfonnance, that is, the I/O cost of the state file
dumping, through the proposal of different techniques to reduce the checkpoint file
size. In addition, a process migration based on checkpointing is also proposed, that
allows for proactively migrating processes fram nades that are about to fail, avoiding
the complete restart of the execution and, thus, improving the application resilience.
Finally, this Thesis also includes a proposal to transparently transform MPI applications
into malleable jobs, that is, parallel programs that are able to adapt their
execution to the number of available processors at runtime, which provides important
benefits for the end users and the whole system, such as higher productivity
and a better response time, or a greater resilience to node failures.
All the solutions proposed in this Thesis have been implemented at the application-level,
and they are independent of the hardware architecture, the operating system,
or the MPI implementation used, and of any higher-level frameworks, such as job
submission frameworks
- …