74 research outputs found
An Inhomogeneous Model for Laser Welding of Industrial Interest
An innovative non-homogeneous dynamic model is presented for the recovery of temperature during the industrial laser welding process of Al-Si (Formula presented.) alloy plates. It considers that, metallurgically, during welding, the alloy melts with the presence of solid/liquid phases until total melt is achieved, and afterwards it resolidifies with the reverse process. Further, a polynomial substitute thermal capacity of the alloy is chosen based on experimental evidence so that the volumetric solid-state fraction is identifiable. Moreover, to the usual radiative/convective boundary conditions, the contribution due to the positioning of the plates on the workbench is considered (endowing the model with Cauchy–Stefan–Boltzmann boundary conditions). Having verified the well-posedness of the problem, a Galerkin-FEM approach is implemented to recover the temperature maps, obtained by modeling the laser heat sources with formulations depending on the laser sliding speed. The results achieved show good adherence to the experimental evidence, opening up interesting future scenarios for technology transfer
Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P.
The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft
The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform
The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft
Analyse automatisée des causes de blocage de processus à partir d'une trace d'exécution
RÉSUMÉ
La tendance actuelle à la parallélisation fait que les développeurs sont de plus
en plus fréquemment confrontés à des bogues de performance dont la cause est
difficiles à repérer. Ceux-ci sont fréquemment dus à des interactions imprévues entre
des composants logiciels qui s’exécutent parallèlement. L’intégration grandissante
et la complexité liée au débogage des grands systèmes informatiques exacerbent
ce problème. Un type de bogue de performance particulièrement difficile est celui
dont la réelle cause est séparée du symptôme par une chaîne de blocages. Les outils
actuels sont d’une aide limitée pour résoudre ces bogues.
L’objectif de ce travail est donc la conception d’un outil aidant au débogage de
problèmes de performance impliquant des chaînes de blocages. Ce mémoire introduit
cette nouvelle approche et traite de son implémentation dans l’Analyseur de
délais de LTTV. Le Linux Trace Toolkit (LTTng) est utilisé pour l’enregistrement
de traces et la majorité de l’instrumentation, permettant le traçage de systèmes en
production avec grande précision et un impact minimal sur la performance. Cette
approche utilise exclusivement de l’instrumentation noyau et ne requiert pas la
recompilation des applications. L’outil d’analyse produit un rapport qui indique
en détail de quelle façon le temps a été consommé dans un processus donné entre
deux événements. Pour chacune des catégories, un autre rapport indique la liste
des intervalles de temps où le processus a été dans cet état. Enfin, dans les cas où
le processus était bloqué, la chaîne de blocages complète est affichée.
L’Analyseur de délais de LTTV a été utilisé pour analyser et corriger rapidement
des problèmes de performance complexes, chose impossible avec les outils existants.
Le temps d’analyse est linéaire par rapport à la taille de la trace. L’usage de la
mémoire lors de l’analyse de grandes traces est linéaire par rapport à la taille de la
trace, mais une méthode pour la rendre constante est décrite.
La méthode conçue pourrait servir de base à des travaux futurs, dont l’analyse de chaînes de blocages qui impliquent plusieurs ordinateurs ou qui impliquent à la fois
des machines physiques et les machines virtuelles qu’elles hébergent.----------ABSTRACT
The current trend towards parallelization puts developers more and more in situations
where they are confronted with performance bugs whose cause is difficult to
indentify. These are frequently due to unexpected interactions between software
components that execute concurrently. The tighter integration and the complexity
of debugging large systems worsen this problem. One type of bug which is particularly
difficult to locate is a performance problem whose root cause is separated
from its symptom by a chain of blockings. Current tools provide little help with
these problems.
The aim of this work is therefore to design a tool that helps debugging performance
problems involving chains of blockings. This thesis introduces this new approach
and discusses its implementation in the LTTV Delay Analyzer. The Linux
Trace Toolkit (LTTng) is used for trace recording and most of the instrumentation,
allowing the tracing of production systems with great precision and minimal
performance impact. This approach uses solely kernel instrumentation and does
not require the recompilation of applications. The analysis tool produces a report
that shows in detail in what way time was spent in a process between two given
events. For each category, another report shows the list of time spans during which
the process was in that state. Finally, in cases where the process was blocked, the
complete chain of blockings is displayed.
The LTTV Delay Analyzer was used to analyze and fix quickly complex performance
problems, something impossible with existing tools. Analysis time grows
linearly with trace size. Memory usage during the analysis of large traces grows
linearly with trace size, but a strategy to make it constant is described.
This new method could serve as a starting point for future work, including the
analysis of blocking chains that involve many computers or that involve physical
machines as well as the virtual machines they host
Qucs workbook
This document is intended to be a work book for RF and microwave designers.Our intention is not to provide an RF course, but some touchy RF topics. The goal is to insist on design rules and work flow for RF design using CAD programs. This work flow will be handled through different subjects
Functional Verification of Processor Execution Units
Práce se zaobĂrá zaÄŤlenÄ›nĂm procesu funkÄŤnĂ verifikace do vĂ˝vojovĂ©ho cyklu návrhu funkÄŤnĂch jednotek v prostĹ™edĂ pro souběžnĂ˝ návrh hardwaru a softwaru systĂ©mu Codasip. CĂlem bylo navrhnout a implementovat verifikaÄŤnĂ prostĹ™edĂ v jazyku SystemVerilog pro verifikaci automaticky generovanĂ© hardwarovĂ© reprezentace tÄ›chto jednotek. Na začátku jsou rozebrány pĹ™Ănosy a obvyklĂ© postupy pĹ™i funkÄŤnĂ verifikaci a vlastnosti systĂ©mu Codasip.  Dále je v práci popsán návrh, implementace, analĂ˝za prĹŻbÄ›hu a vĂ˝sledkĹŻ testĹŻ verifikace simulaÄŤnĂho modelu aritmeticko-logickĂ© jednotky. ZávÄ›rem jsou zhodnoceny dosaĹľenĂ© vĂ˝sledky práce a navrhnuta zlepšenĂ pro moĹľnĂ˝ dalšà rozvoj verifikaÄŤnĂho prostĹ™edĂ.The thesis deals with integration of functional verification into the design cycle of execution units in  a hardware-software co-design environment of the Codasip system. The aim of the thesis is to design and implement a verification environment in SystemVerilog in order to verify automatically generated hardware representation of the execution units. In the introduction, advantages and basic methods of functional verification and principles of the Codasip system are discussed. Next chapters describe the process of design and implementation of the verification environment of arithmetic-logic unit as well as the analysis of the results of verification. In the end, a review of accomplished goals and the suggestions for future development of the verification environment are made.
A quantitative method to decide where and when it is profitable to use models for integration and testing
Industrial trends show that the lead time and costs of integrating and testing high-tech multi-disciplinary systems are becoming critical factors for commercial success. In our research, we developed a method for early, model-based integration and testing to reduce this criticality. Although its benefits have been demonstrated in industrial practice, the method requires certain investments to achieve these benefits, e.g. time needed for modeling. Making the necessary trade-off between investments and potential benefits to decide when modeling is profitable is a difficult task that is often based on personal intuition and experience. In this paper, we describe how integration and test sequencing techniques can be used to quantitatively determine where and when the integration and testing process can profit from models. An industrial case study shows that it is feasible to quantify the costs and benefits of using models in terms of risk, time, and costs, such that the profitability can be determined
Annual reports of the town officers. Brookfield, New Hampshire, 1979. For the fiscal year ending December 31, 1979. Vital statistics for 1979.
This is an annual report containing vital statistics for a town/city in the state of New Hampshire
A Case for Fine-Grain Adaptive Cache Coherence
As transistor density continues to grow geometrically, processor manufacturers are already able to place a hundred cores on a chip (e.g., Tilera TILE-Gx 100), with massive multicore chips on the horizon. Programmers now need to invest more effort in designing software capable of exploiting multicore parallelism. The shared memory paradigm provides a convenient layer of abstraction to the programmer, but will current memory architectures scale to hundreds of cores? This paper directly addresses the question of how to enable scalable memory systems for future multicores. We develop a scalable, efficient shared memory architecture that enables seamless adaptation between private and logically shared caching at the fine granularity of cache lines. Our data-centric approach relies on in hardware runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality. This allows us to better exploit on-chip cache capacity and enable low-latency memory access in large-scale multicores
- …