3 research outputs found

    Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project

    Full text link
    The effect of the operating system on application perfor-mance is an increasingly important consideration in high performance computing. OS kernel measurement is key to understanding the performance influences and the interre-lationship of system and user-level performance factors. The KTAU (Kernel TAU) methodology and Linux-based framework provides parallel kernel performance measure-ment from both a kernel-wide and process-centric perspec-tive. The first characterizes overall aggregate kernel per-formance for the entire system. The second characterizes kernel performance when it runs in the context of a partic-ular process. KTAU extends the TAU performance system with kernel-level monitoring, while leveraging TAU’s mea-surement and analysis capabilities. We explain the rational and motivations behind our approach, describe the KTAU design and implementation, and show working examples on multiple platforms demonstrating the versatility of KTAU in integrated system / application monitoring. 1

    Synchronisation de traces distribuées à l'aide d'événements de bas niveau

    Get PDF
    RÉSUMÉ Le traçage d’événement a prouvé son efficacité dans l’identification de problèmes de fonctionnalité et de performance. Il peut s’appliquer dans le développement d’applications, de systèmes d’exploitation et de pilotes matériels. Pour l’appliquer également à des systèmes distribués, il est possible d’enregistrer des traces individuellement sur chaque noeud puis de regrouper ces traces lors de l’analyse a posteriori. Il faut alors que les estampilles de temps associées aux événements soient synchronisées avec fidélité et justesse. Ce mémoire traite de la synchronisation hors ligne de groupes de traces enregistrées sur un système distribué. L’objectif est d’identifier un algorithme de synchronisation de trace précis, qui s’exécute en temps linéaire par rapport au nombre d’événements dans le groupe de traces, qui garantit l’absence d’inversion de message et qui peut déterminer des bornes sur la justesse. Dans l’approche préconisée, le traçage noyau enregistre avec une faible intrusivité des événements correspondant à l’envoi et la réception de messages réseau. Les relations d’ordre entre ces événements sont utilisées pour construire une base de temps globale. L’approche se concentre sur la synchronisation de paires de traces à l’aide de fonctions linéaires de correction d’horloge. Les paramètres de ces fonctions sont identifiés à l’aide de l’algorithme basé sur la méthode des enveloppes convexes. Cet algorithme garantit l’absence d’inversion de message et il a été étendu afin de déterminer des bornes sur la justesse à tout moment lors du traçage. L’algorithme a été intégré à un outil de visualisation de trace. Son application à des traces réelles, variées et de grandes tailles révèle que la précision de la synchronisation est favorisée par l’utilisation d’un réseau à plus faible latence ou d’un débit de message plus élevé. En contrepartie, une durée de traçage plus longue réduit la précision et fausse la détermination des bornes de justesse. Cette situation est mise en évidence grâce à l’utilisation de métriques sur les temps de transfert des messages après synchronisation. L’étude de la performance de l’algorithme confirme qu’il peut réaliser la synchronisation avec un ordre d’exécution linéaire. La détermination des bornes de justesse a un comportement quadratique dans le pire des cas, mais, avec des traces réelles, elle s’exécute en temps presque linéaire. Lors de nos expériences, nous avons atteint une justesse de ±15 us et une précision estimée à 9 us sur un réseau comportant une latence minimale estimée à 39 us. La détermination de bornes sur la justesse de la synchronisation représente une avancée scientifique alors que l’implantation efficace de l’algorithme de synchronisation dans un système de traçage pratique représente une avancée technique.----------ABSTRACT Event tracing has proven to be a valuable tool for identifying functional and performance problems. It has helped to identify problems at the application, operating system and device driver level. In order to extend its benefits to distributed systems, one approach is to record traces individually on each node and to analyze them in a post-processing step. In order for this to be meaningful, the timestamps in the traces have to be synchronized with precision and accuracy. This dissertation focuses on offline synchronization of traces recorded on distributed systems. The objective is to identify a trace synchronization algorithm that is precise, has a linear run time order in regards to the number of events in the traces, that can guarantee the absence of message inversions and identify accuracy bounds. The method put forward uses kernel tracing to record network events with a low intrusiveness. A global timebase is built by analyzing the strict ordering relationships between events that correspond to the emission and reception of messages. We concentrate on the synchronization of pair of traces using linear clock correction functions. The parameters of these functions are identified using the convex hull algorithm. It guarantees the absence of message inversions and it has been extended to identify accuracy bounds at any point in the tracing interval. This algorithm was contributed to a trace analysis tool and was used on a variety of long running and large traces recorded on real systems. The experiments conducted show that offline synchronization accuracy is improved by using a network with lower latency and by using a higher message rate. With a constant message rate, lengthening the trace duration reduces precision and gives a false impression of improving accuracy. This is detected using metrics based on message propagation delays after synchronization. Synchronization factors can be found in linear time. The time to find accuracy bounds is quadratic in the worst case but it scales almost linearly on practical traces. During our experiments, we have achieved a synchronization accuracy of ±15 us and an estimated precision of 9 us on a network with an estimated minimum propagation delay of 39 us. Identifying strict accuracy bounds during offline synchronization of traces represents a scientific advance whereas the efficient implementation of a trace synchronization algorithm in a practical tool represents a technical progress

    Spectral analysis of executions of computer programs and its applications on performance analysis

    Get PDF
    This work is motivated by the growing intricacy of high performance computing infrastructures. For example, supercomputer MareNostrum (installed in 2005 at BSC) has 10240 processors and currently there are machines with more than 100.000 processors. The complexity of this systems increases the complexity of the manual performance analysis of parallel applications. For this reason, it is mandatory to use automatic tools and methodologies.The performance analysis group of BSC and UPC has a large experience in analyzing parallel applications. The approach of this group consists mainly in the analysis of tracefiles (obtained from parallel applications executions) using performance analysis and visualization tools, such as Paraver. Taking into account the general characteristics of the current systems, this method can sometimes be very expensive in terms of time and inefficient. To overcome these problems, this thesis makes several contributions.The first one is an automatic system able to detect the internal structure of executions of high performance computing applications. This automatic system is able to rule out nonsignificant regions of executions, to detect redundancies and, finally, to select small but significant execution regions. This automatic detection process is based on spectral analysis (wavelet transform, fourier transform, etc..) and works detecting the most important frequencies of the application's execution. These main frequencies are strongly related to the internal loops of the application' source code. Finally, it is important to state that an automatic detection of small but significant execution regions reduces remarkably the complexity of the performance analysis process.The second contribution is an automatic methodology able to show general but nontrivial performance trends. They can be very useful for the analyst in order to carry out a performance analysis of the application. The automatic methodology is based on an analytical model. This model consists in several performance factors. Such factors modify the value of the linear speedup in order to fit the real speedup. That is, if this real speedup is far from the linear one, we will detect immediately which one of the performance factors is undermining the scalability of the application. The second main characteristic of the analytical model is that it can be used to predict the performance of high performance computing applications. From several execution on a few of processors, we extract model's performance factors and we extrapolate these values to executions on higher number of processors. Finally, we obtain a speedup prediction using the analytical model.The third contribution is the automatic detection of the optimal sampling frequency of applications. We show that it is possible to extract this frequency using spectral analysis. In case of sequential applications, we show that to use this frequency improves existing results of recognized techniques focused on the reduction of serial application's instruction execution stream (SimPoint, Smarts, etc..). In case of parallel benchmarks, we show that the optimal frequency is very useful to extract significant performance information very efficiently and accurately.In summary, this thesis proposes a set of techniques based on signal processing. The main focus of these techniques is to perform an automatic analysis of the applications, reporting and initial diagnostic of their performance and showing their internal iterative structure. Finally, these methods also provide a reduced tracefile from which it is easy to start manual finegrain performance analysis. The contributions of the thesis are not reduced to proposals and publications. The research carried out these last years has provided a tool for analyzing applications' structure. Even more, the methodology is general and it can be adapted to many performance analysis methods, improving remarkably their efficiency, flexibility and generality
    corecore