363 research outputs found

    Euro-Par 2006 Parallel Processing

    Full text link

    Wireless Communication Solution for Distributed Structural Health Monitoring

    Get PDF
    This paper describes a design of wireless distributed SHM (Structure Health Monitoring) system with a particular emphasis on comparison of wireless communication standards. The presented solution is being deployed in the TULCOEMPA project. Several wireless communication standards are compared, with their benefits, disadvantages and typical areas of application. A choice of proper ISM (Industrial Scientific Medical) band and reasons for use of Wireless Sensor Networks are also discussed. The last part of this paper presents the proposed structure and designed prototype. The chosen architecture of the system and the program algorithm used for communication and measurements are described

    Release and Verification of an Operating System for Testing e-Flash on Microcontrollers for Automotive Applications based on Multicore Architecture

    Get PDF
    The cars produced contain an increasing number of electronic devices for active assistance to driving, safety controls, energy efficiency, passenger comfort and entertainment. Safety is the keyword and means to have electronic components high reliability. Infineon microcontroller division works to improve reliability and guarantee the quality of microcontroller flash memories. The thesis goal is to verify the operating system used to test the microcontrollers flash memorie

    Analytic and Machine Learning Based Design of Monolithic Transistor-Antenna for Plasmonic Millimeter-Wave Detectors

    Get PDF
    Department of Electrical EngineeringThis thesis reports an advanced analysis on a monolithic transistor-antenna by designing a ring-type asymmetric FET itself as a receiving antenna element which receives millimeter-waves in a loss-less manner with a plasmonic ampli fication for millimeter-wave (mmW) detectors. The proposed transistor-antenna device combines the plasmonic and the electromagnetic (EM) aspects in a single place. As a result, it can absorb the incoming mmW and transfer power directly to the ring-type asymmetric channel without any feeding line and a separate antenna element. Both the charge asymmetry in the device channel and the antenna coupling are contributing to the enhanced photoresponse. Among the two factors, the improved antenna coupling is more dominant in the performance enhancement of our proposed design. Also, our transistor-antenna device have enhanced performance with a uniformly enhanced responsivity of every pixel by characterizing its impedance exactly pursuing real-time mmW imaging. Operation principle of the proposed device is discussed, focusing on how signal transmission through the ring-type structure is available without any feeding line between the antenna and the detector. To determine the antenna geometry aiming for a desired resonant frequency, we present an efficient design procedure based on periodic bandgap analysis combined with parametric electromagnetic simulations. From a fabricated ring-type FET-based monolithic antenna device, we demonstrated the highly enhanced optical responsivity and the reduced optical noise-equivalent power, which are in comparable order with the reported state-of-the-art CMOS-based antenna integrated direct detectors. Another part of the thesis focuses on developing machine learning models to enable fast, accurate design and veri fication of electromagnetic structures. We proposed a novel Bayesian learning algorithm named as Bayesian clique learning, for searching the optimal electromagnetic design parameter by using the structural property of EM simulation data set. Along with this, we also given an inverse problem approach for designing the electromagnetic structures which suggest going in the opposite direction to determine the design parameters from characteristics of the desired output.clos

    Point-to-point and congestion bandwidth estimation: experimental evaluation on PlanetLab

    Get PDF
    In large scale Internet platforms, measuring the available bandwidth between nodes of the platform is difficult and costly. However, having access to this information allows to design clever algorithms to optimize resource usage for some collective communications, like broadcasting a message or organizing master/slave computations. In this paper, we analyze the feasibility to provide estimations, based on a limited number of measurements, for the point-to-point available bandwidth values, and for the congestion which happens when several communications take place at the same time. We present a dataset obtained with both types of measurements performed on a set of nodes from the PlanetLab platform. We show that matrix factorization techniques are quite efficient at predicting point-to-point available bandwidth, but are not adapted for congestion analysis. However, a LastMile modeling of the platform allows to perform congestion predictions with a reasonable level of accuracy, even with a small amount of information, despite the variability of the measured platform

    Energy aware task allocation algorithms for wireless sensor networks

    Get PDF
    Complex wireless sensor network (WSN) applications, such as those in Internet of things or in-network processing, are pushing the requirements of energy efficiency and long-term operation of the network drastically. Energy aware task allocation becomes crucial to extend the network lifetime, by efficiently distributing the tasks of applications among sensor nodes. Although task allocation has been deeply studied in wired systems, the resulting approaches are insufficient for WSNs due to limited battery resources and computing capability of WSN nodes, as well as the special wireless communication. This work focuses on designing energy aware task allocation algorithms to extend the network lifetime of WSNs. More precisely, this work firstly proposes a centralized static task allocation algorithm (CSTA) for cluster based WSNs. Since a WSN application can be modeled by a directed acyclic graph (DAG), the task allocation problem is formulated as partitioning the modeled DAG graph into two subgraphs: one for the slave node and the other for the master node. By using a binary vector variable to represent the partition cut, CSTA formulates the problem of maximizing network lifetime as a binary integer linear programming (BILP) problem. It provides one fixed time invariant partition cut (task allocation solution) for each slave node to balance the workload distribution of tasks. Moreover, motivated by the fact that using multiple partition cuts can achieve more balanced workload distribution, this work extends CSTA to a centralized dynamic task allocation algorithm, CDTA. By using a probability vector variable to stand for partition cuts with different weights, CDTA formulates the dynamic task allocation problem as a linear programing (LP) problem. Due to the high complexity of centralized algorithms, this work further proposes a very lightweight distributed optimal on-line task allocation algorithm (DOOTA). Through an indepth analysis, it proves that the optimal task allocation solution consists of at most two partition cuts for each slave nodes. Based on this analysis, DOOTA enables each slave node to calculate its own optimal task allocation solution by negotiating with the master node with a very short time. These contributions significantly improve the application performance for WSNs, but also for other domains, e.g, mobile edge/fog computing. Furthermore, the proposed task allocation algorithms are extended for different task scenarios and network structures, i.e., applications with conditional tasks, joint local and global applications and multi-hop mesh network. Given a condition triggered application, it is modeled by a DAG graph with conditional branches. This conditional DAG is further decomposed into multiple stationary DAG graphs without conditional branches according to the satisfaction probability of each condition. Based on this modeling, a static and a dynamic condition triggered task allocation algorithms (SCTTA and DCTTA) are proposed by considering the multiple stationary DAG simultaneously. Targeting the joint local and global applications, this work designs a static and a dynamic joint task allocation algorithms, SJTA and DJTA, based on BILP and LP, respectively. The modeling of local task allocation problem does not change, while the global task allocation problem is modeled by dividing the global DAG graph into different subgraphs mapping to the slave and master nodes. Besides the extensions for different task scenarios, this work presents a dynamic task allocation algorithm for multi-hop mesh networks (DTA-mhop) as well. The corresponding task allocation problem is modeled by dividing the DAG graph of each sensor node into multiple subgraphs mapping to itself, the routing and sink nodes. By using the summation of assigned tasks for each node, DTA-mhop formulates the lifetime maximization as a LP problem. The proposed task allocation algorithms are firstly evaluated using simulations and real WSN applications, in terms of network lifetime increase and algorithm runtime. In order to investigate the algorithm's performance in realistic scenarios, the CSTA, CDTA and DOOTA algorithms are implemented in a real WSN based on the OpenMote platform. Both the simulation and implementation results show that the network lifetime can be dramatically extended. Remarkably, the network lifetime improvements are more significant for addressing complex applications. The proposed task allocation algorithms are therefore suitable for WSNs, and they can also be easily adapted to other wireless domains

    Piattaforme multicore e integrazione tri-dimensionale: analisi architetturale e ottimizzazione

    Get PDF
    Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.I sistemi integrati moderni sono architetture many-core, in cui spesso lo spazio di memoria è condiviso fra i processori. Per ridurre i consumi, molte di queste architetture sostituiscono le cache dati con memorie scratchpad gestite in software, per massimizzarne la località alle CPU e aumentare le performance. Questo significa che i dati devono essere spostati manualmente da parte del programmatore. Inoltre, tradurre in perfomance l’enorme parallelismo potenziale delle piattaforme many-core non è semplice. Per supportare la programmazione, diversi programming model sono stati proposti, e siccome lavorano ad un alto livello di astrazione, sfruttano delle librerie di runtime che forniscono servizi di base quali sincronizzazione, allocazione della memoria, threading. Queste librerie hanno un costo, che nei sistemi integrati è troppo elevato e ostacola il raggiungimento delle piene performance. Questa tesi analizza come un programming model ad alto livello di astrazione – OpenMP – possa essere efficientemente supportato, se il suo stack software viene adattato per sfruttare al meglio la piattaforma sottostante. In una prima parte, studio diversi meccanismi di sincronizzazione e comunicazione fra thread paralleli, portati sulle piattaforme many-core. In seguito, li utilizzo per scrivere un runtime di supporto a OpenMP che sia il più possibile efficente e “leggero” e che supporti paradigmi di parallelismo multi-livello e irregolare, spesso presenti nelle applicazioni moderne. Una seconda parte della tesi esplora le architetture eterogenee, ossia con acceleratori hardware. Queste architetture soffrono di problematiche sia i) per il processo di design della piattaforma, che ii) di scalabilità della piattaforma stessa (aumento del numero degli acceleratori e dei processori), che iii) di programmabilità. La tesi propone delle soluzioni a tutti e tre i problemi. Il linguaggio di programmazione usato è OpenMP, sia per la sua grande espressività a livello semantico, sia perché è lo standard de-facto per programmare sistemi a memoria condivisa

    Parallel Transferable Uniform Multi-Round Algorithm for Minimizing Makespan

    Get PDF
    In parallel computing systems using the master/worker model for distributed grid computing, as the size of handling data grows, the increase in the data transmission time degrades the performance. For divisible workload applications, therefore, multiple-round scheduling algorithms have been being developed to mitigate the adverse effect of longer data transmission time by dividing the data into chunks to be sent out in multiple rounds, thus overlapping the times required for computation and transmission. However, a standard multiple-round scheduling algorithm, Uniform Multi-Round (UMR), adopts a sequential transmission model where the master communicates with one worker at a time, thus the transmission capacity of the link attached to the master cannot be fully utilized due to the limits of worker-side capacity. In the present study, a Parallel Transferable Uniform Multi-Round algorithm (PTUMR) is proposed. It efficiently utilizes the data transmission capacity of network links by allowing chunks to be transmitted in parallel to workers. This algorithm divides workers into groups in a way that fully uses the link bandwidth of the master under some constraints and considers each group of workers as one virtual worker. In particular, introducing a Grouping Threshold effectively deals with very heterogeneous workers in both data transmission and computation capacities. Then, the master schedules sequential data transmissions to the virtual workers in an optimal way like in UMR. The performance evaluations show that the proposed algorithm achieves significantly shorter turnaround times (i.e., makespan) compared with UMR regardless of heterogeneity of workers, which are close to the theoretical lower limits

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers
    • …
    corecore