74 research outputs found

    An Inhomogeneous Model for Laser Welding of Industrial Interest

    Get PDF
    An innovative non-homogeneous dynamic model is presented for the recovery of temperature during the industrial laser welding process of Al-Si (Formula presented.) alloy plates. It considers that, metallurgically, during welding, the alloy melts with the presence of solid/liquid phases until total melt is achieved, and afterwards it resolidifies with the reverse process. Further, a polynomial substitute thermal capacity of the alloy is chosen based on experimental evidence so that the volumetric solid-state fraction is identifiable. Moreover, to the usual radiative/convective boundary conditions, the contribution due to the positioning of the plates on the workbench is considered (endowing the model with Cauchy–Stefan–Boltzmann boundary conditions). Having verified the well-posedness of the problem, a Galerkin-FEM approach is implemented to recover the temperature maps, obtained by modeling the laser heat sources with formulations depending on the laser sliding speed. The results achieved show good adherence to the experimental evidence, opening up interesting future scenarios for technology transfer

    Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add

    Get PDF
    The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P. The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft

    The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform

    Get PDF
    The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft

    Book Reviews

    Get PDF

    Analyse automatisée des causes de blocage de processus à partir d'une trace d'exécution

    Get PDF
    RÉSUMÉ La tendance actuelle à la parallélisation fait que les développeurs sont de plus en plus fréquemment confrontés à des bogues de performance dont la cause est difficiles à repérer. Ceux-ci sont fréquemment dus à des interactions imprévues entre des composants logiciels qui s’exécutent parallèlement. L’intégration grandissante et la complexité liée au débogage des grands systèmes informatiques exacerbent ce problème. Un type de bogue de performance particulièrement difficile est celui dont la réelle cause est séparée du symptôme par une chaîne de blocages. Les outils actuels sont d’une aide limitée pour résoudre ces bogues. L’objectif de ce travail est donc la conception d’un outil aidant au débogage de problèmes de performance impliquant des chaînes de blocages. Ce mémoire introduit cette nouvelle approche et traite de son implémentation dans l’Analyseur de délais de LTTV. Le Linux Trace Toolkit (LTTng) est utilisé pour l’enregistrement de traces et la majorité de l’instrumentation, permettant le traçage de systèmes en production avec grande précision et un impact minimal sur la performance. Cette approche utilise exclusivement de l’instrumentation noyau et ne requiert pas la recompilation des applications. L’outil d’analyse produit un rapport qui indique en détail de quelle façon le temps a été consommé dans un processus donné entre deux événements. Pour chacune des catégories, un autre rapport indique la liste des intervalles de temps où le processus a été dans cet état. Enfin, dans les cas où le processus était bloqué, la chaîne de blocages complète est affichée. L’Analyseur de délais de LTTV a été utilisé pour analyser et corriger rapidement des problèmes de performance complexes, chose impossible avec les outils existants. Le temps d’analyse est linéaire par rapport à la taille de la trace. L’usage de la mémoire lors de l’analyse de grandes traces est linéaire par rapport à la taille de la trace, mais une méthode pour la rendre constante est décrite. La méthode conçue pourrait servir de base à des travaux futurs, dont l’analyse de chaînes de blocages qui impliquent plusieurs ordinateurs ou qui impliquent à la fois des machines physiques et les machines virtuelles qu’elles hébergent.----------ABSTRACT The current trend towards parallelization puts developers more and more in situations where they are confronted with performance bugs whose cause is difficult to indentify. These are frequently due to unexpected interactions between software components that execute concurrently. The tighter integration and the complexity of debugging large systems worsen this problem. One type of bug which is particularly difficult to locate is a performance problem whose root cause is separated from its symptom by a chain of blockings. Current tools provide little help with these problems. The aim of this work is therefore to design a tool that helps debugging performance problems involving chains of blockings. This thesis introduces this new approach and discusses its implementation in the LTTV Delay Analyzer. The Linux Trace Toolkit (LTTng) is used for trace recording and most of the instrumentation, allowing the tracing of production systems with great precision and minimal performance impact. This approach uses solely kernel instrumentation and does not require the recompilation of applications. The analysis tool produces a report that shows in detail in what way time was spent in a process between two given events. For each category, another report shows the list of time spans during which the process was in that state. Finally, in cases where the process was blocked, the complete chain of blockings is displayed. The LTTV Delay Analyzer was used to analyze and fix quickly complex performance problems, something impossible with existing tools. Analysis time grows linearly with trace size. Memory usage during the analysis of large traces grows linearly with trace size, but a strategy to make it constant is described. This new method could serve as a starting point for future work, including the analysis of blocking chains that involve many computers or that involve physical machines as well as the virtual machines they host

    Qucs workbook

    Get PDF
    This document is intended to be a work book for RF and microwave designers.Our intention is not to provide an RF course, but some touchy RF topics. The goal is to insist on design rules and work flow for RF design using CAD programs. This work flow will be handled through different subjects

    Functional Verification of Processor Execution Units

    Get PDF
    Práce se zaobírá začleněním procesu funkční verifikace do vývojového cyklu návrhu funkčních jednotek v prostředí pro souběžný návrh hardwaru a softwaru systému Codasip. Cílem bylo navrhnout a implementovat verifikační prostředí v jazyku SystemVerilog pro verifikaci automaticky generované hardwarové reprezentace těchto jednotek. Na začátku jsou rozebrány přínosy a obvyklé postupy při funkční verifikaci a vlastnosti systému Codasip.  Dále je v práci popsán návrh, implementace, analýza průběhu a výsledků testů verifikace simulačního modelu aritmeticko-logické jednotky. Závěrem jsou zhodnoceny dosažené výsledky práce a navrhnuta zlepšení pro možný další rozvoj verifikačního prostředí.The thesis deals with integration of functional verification into the design cycle of execution units in  a hardware-software co-design environment of the Codasip system. The aim of the thesis is to design and implement a verification environment in SystemVerilog in order to verify automatically generated hardware representation of the execution units. In the introduction, advantages and basic methods of functional verification and principles of the Codasip system are discussed. Next chapters describe the process of design and implementation of the verification environment of arithmetic-logic unit as well as the analysis of the results of verification. In the end, a review of accomplished goals and the suggestions for future development of the verification environment are made.

    A quantitative method to decide where and when it is profitable to use models for integration and testing

    Get PDF
    Industrial trends show that the lead time and costs of integrating and testing high-tech multi-disciplinary systems are becoming critical factors for commercial success. In our research, we developed a method for early, model-based integration and testing to reduce this criticality. Although its benefits have been demonstrated in industrial practice, the method requires certain investments to achieve these benefits, e.g. time needed for modeling. Making the necessary trade-off between investments and potential benefits to decide when modeling is profitable is a difficult task that is often based on personal intuition and experience. In this paper, we describe how integration and test sequencing techniques can be used to quantitatively determine where and when the integration and testing process can profit from models. An industrial case study shows that it is feasible to quantify the costs and benefits of using models in terms of risk, time, and costs, such that the profitability can be determined

    Annual reports of the town officers. Brookfield, New Hampshire, 1979. For the fiscal year ending December 31, 1979. Vital statistics for 1979.

    Get PDF
    This is an annual report containing vital statistics for a town/city in the state of New Hampshire

    A Case for Fine-Grain Adaptive Cache Coherence

    Get PDF
    As transistor density continues to grow geometrically, processor manufacturers are already able to place a hundred cores on a chip (e.g., Tilera TILE-Gx 100), with massive multicore chips on the horizon. Programmers now need to invest more effort in designing software capable of exploiting multicore parallelism. The shared memory paradigm provides a convenient layer of abstraction to the programmer, but will current memory architectures scale to hundreds of cores? This paper directly addresses the question of how to enable scalable memory systems for future multicores. We develop a scalable, efficient shared memory architecture that enables seamless adaptation between private and logically shared caching at the fine granularity of cache lines. Our data-centric approach relies on in hardware runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality. This allows us to better exploit on-chip cache capacity and enable low-latency memory access in large-scale multicores
    • …
    corecore