Search CORE

1,187 research outputs found

Improving multithreading performance for clustered VLIW architectures.

Author: Gupta Manoj
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2013
Field of study

Very Long Instruction Word (VLIW) processors are very popular in embedded and mobile computing domain. Use of VLIW processors range from Digital Signal Processors (DSPs) found in a plethora of communication and multimedia devices to Graphics Processing Units (GPUs) used in gaming and high performance computing devices. The advantage of VLIWs is their low complexity and low power design which enable high performance at a low cost. Scalability of VLIWs is limited by the scalability of register file ports. It is not viable to have a VLIW processor with a single large register file because of area and power consumption implications of the register file. Clustered VLIW solve the register file scalability issue by partitioning the register file into multiple clusters and a set of functional units that are attached to register file of that cluster. Using a clustered approach, higher issue width can be achieved while keeping the cost of register file within reasonable limits. Several commercial VLIW processors have been designed using the clustered VLIW model. VLIW processors can be used to run a larger set of applications. Many of these applications have a good Lnstruction Level Parallelism (ILP) which can be efficiently utilized. However, several applications, specially the ones that are control code dominated do not exibit good ILP and the processor is underutilized. Cache misses is another major source of resource underutiliztion. Multithreading is a popular technique to improve processor utilization. Interleaved MultiThreading (IMT) hides cache miss latencies by scheduling a different thread each cycle but cannot hide unused instructions slots. Simultaneous MultiThread (SMT) can also remove ILP under-utilization by issuing multiple threads to fill the empty instruction slots. However, SMT has a higher implementation cost than IMT. The thesis presents Cluster-level Simultaneous MultiThreading (CSMT) that supports a limited form of SMT where VLIW instructions from different threads are merged at a cluster-level granularity. This lowers the hardware implementation cost to a level comparable to the cheap IMT technique. The more complex SMT combines VLIW instructions at the individual operation-level granularity which is quite expensive especially in for a mobile solution. We refer to SMT at operation-level as OpSMT to reduce ambiguity. While previous studies restricted OpSMT on a VLIW to 2 threads, CSMT has a better scalability and upto 8 threads can be supported at a reasonable cost. The thesis proposes several other techniques to further improve CSMT performance. In particular, Cluster renaming remaps the clusters used by instructions of different threads to reduce resource conflicts. Cluster renaming is quite effective in reducing the issue-slots under-utilization and significantly improves CSMT performance.The thesis also proposes: a hybrid between IMT and CSMT which increases the number of supported threads, heterogeneous instruction merging where some instructions are combined using SMT and CSMT rest, and finally, split-issue, a technique that allows to launch partially an instruction making it easier to be combined with others

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Network Virtual Machine (NetVM): A New Architecture for Efficient and Portable Packet Processing Applications

Author: Baldi Mario
Buffa D.
Degioanni L.
Risso Fulvio Giovanni Ottavio
Stirano F.
Varenni G.
Publication venue: IEEE
Publication date: 01/01/2005
Field of study

A challenge facing network device designers, besides increasing the speed of network gear, is improving its programmability in order to simplify the implementation of new applications (see for example, active networks, content networking, etc). This paper presents our work on designing and implementing a virtual network processor, called NetVM, which has an instruction set optimized for packet processing applications, i.e., for handling network traffic. Similarly to a Java Virtual Machine that virtualizes a CPU, a NetVM virtualizes a network processor. The NetVM is expected to provide a compatibility layer for networking tasks (e.g., packet filtering, packet counting, string matching) performed by various packet processing applications (firewalls, network monitors, intrusion detectors) so that they can be executed on any network device, ranging from expensive routers to small appliances (e.g. smart phones). Moreover, the NetVM will provide efficient mapping of the elementary functionalities used to realize the above mentioned networking tasks upon specific hardware functional units (e.g., ASICs, FPGAs, and network processing elements) included in special purpose hardware systems possibly deployed to implement network devices

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Parallélisation massive des algorithmes de branchement

Author: Pastrana Cruz Andres
Publication venue: 'Universite de Sherbrooke'
Publication date: 01/01/2021
Field of study

Les problèmes d'optimisation et de recherche sont souvent NP-complets et des techniques de force brute doivent généralement être mises en œuvre pour trouver des solutions exactes. Des problèmes tels que le regroupement de gènes en bio-informatique ou la recherche de routes optimales dans les réseaux de distribution peuvent être résolus en temps exponentiel à l'aide de stratégies de branchement récursif. Néanmoins, ces algorithmes deviennent peu pratiques au-delà de certaines tailles d'instances en raison du grand nombre de scénarios à explorer, pour lesquels des techniques de parallélisation sont nécessaires pour améliorer les performances. Dans des travaux antérieurs, des techniques centralisées et décentralisées ont été mises en œuvre afin d'augmenter le parallélisme des algorithmes de branchement tout en essayant de réduire les coûts de communication, qui jouent un rôle important dans les implémentations massivement parallèles en raison des messages passant entre les processus. Ainsi, notre travail consiste à développer une bibliothèque entièrement générique en C++, nommée GemPBA, pour accélérer presque tous les algorithmes de branchement avec une parallélisation massive, ainsi que le développement d'un outil novateur et simpliste d'équilibrage de charge dynamique pour réduire le nombre de messages transmis en envoyant les tâches prioritaires en premier. Notre approche utilise une stratégie hybride centralisée-décentralisée, qui fait appel à un processus central chargé d'attribuer les rôles des travailleurs par des messages de quelques bits, telles que les tâches n'ont pas besoin de passer par un processeur central. De plus, un processeur en fonctionnement génère de nouvelles tâches si et seulement s'il y a des processeurs disponibles pour les recevoir, garantissant ainsi leur transfert, ce qui réduit considérablement les coûts de communication. Nous avons réalisé nos expériences sur le problème de la couverture minimale de sommets, qui a montré des résultats remarquables, étant capable de résoudre même les graphes DIMACS les plus difficiles avec un simple algorithme MVC.Abstract: Optimization and search problems are often NP-complete, and brute-force techniques must typically be implemented to find exact solutions. Problems such as clustering genes in bioinformatics or finding optimal routes in delivery networks can be solved in exponential-time using recursive branching strategies. Nevertheless, these algorithms become impractical above certain instance sizes due to the large number of scenarios that need to be explored, for which parallelization techniques are necessary to improve the performance. In previous works, centralized and decentralized techniques have been implemented aiming to scale up parallelism on branching algorithms whilst attempting to reduce communication overhead, which plays a significant role in massively parallel implementations due to the messages passing across processes. Thus, our work consists of the development of a fully generic library in C++, named GemPBA, to speed up almost any branching algorithms with massive parallelization, along with the development of a novel and simplistic Dynamic Load Balancing tool to reduce the number of passed messages by sending high priority tasks first. Our approach uses a hybrid centralized-decentralized strategy, which makes use of a center process in charge of assigning worker roles by messages of a few bits of size, such that tasks do not need to pass through a center processor. Also, a working processor will spawn new tasks if and only if there are available processors to receive them, thus, guaranteeing its transfer, and thereby the communication overhead is notably decreased. We performed our experiments on the Minimum Vertex Cover problem, which showed remarkable results, being capable of solving even the toughest DIMACS graphs with a simple MVC algorithm

Savoirs UdeS

Reconfigurable Architectures:From Physical Implementation to Dynamic Behavoir Modelling

Author: Wu Kehuai
Publication venue
Publication date: 01/01/2008
Field of study

Online Research Database In Technology

ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems

Author: Expósito Roberto R.
González-Domínguez Jorge
Publication venue: PLoS
Publication date: 01/01/2018
Field of study

[Abstract]: Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain and FEDER funds of the European Union [grant TIN2016-75845-P (AEI/FEDER/UE)], as well as by Xunta de Galicia (Centro Singular de Investigacion de Galicia accreditation 2016-2019, ref. EDG431G/01).Xunta de Galicia; EDG431G/0

Repositorio da Universidade da Coruña

Implementation of MD5 Framework for Privacy-Preserving Support for Mobile Healthcare

Author: Pulugurta Sreevidya
Publication venue: The Repository at St. Cloud State
Publication date: 01/08/2018
Field of study

The improvement of science and technology has made life so easy and fast that smartphones and other touch-screen minicomputers have become the most trusted personal storage and communication devices for individuals. Comparable to the rich enhancement in wireless body sensor networks, it is valuable to the development of medical treatment to be exceptionally adaptable and become very flexible by means of smartphones through 2G and 3G system bearers. This has made treatment simple even to the common individual in the general public with less payable cash. In this paper, we introduce privacy-preserving support for mobile healthcare using message digest where we have used an MD5 algorithm instead of AES, which can certainly achieve an efficient way and minimizes the memory consumed and the large amount of PHI data of the medical user (patient) is reduced to a fixed amount of size compared to AES which in parallel increases the speed of the data to be sent to TA without any delay which in-turn. This study implements a secure and privacy-preserving opportunistic computing framework (SPOC) for mobile-health care emergency. Utilizing smartphones and SPOC, assets like computing power and energy can be gathered to reliably to take care of intensive personal health information (PHI) of the medicinal client when he/she is in critical situation with minimal privacy disclosure. With these, the healthcare authorities can treat the patients (restorative clients) remotely, where the patients live at home or at different spots they run. This sort of a treatment can be done under mHealth (Mobile-Healthcare). In malice of the fact that in them-medicinal services administration, there are numerous security and information protection issues to be succeed. The main aim of this paper is to bring medical health to patients in remote locations by providing the basic triage of an emergency to increase the patient’s body acceptance until they can reach a proper medical facility, in addition to providing emergency care in minimal payable cash

St. Cloud State University

A Survey on Hardware-aware and Heterogeneous Computing on Multicore Processors and Accelerators

Author: Buchty Rainer
Heuveline Vincent
Karl Wolfgang
Weiß Jan-Philipp
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2009
Field of study

KITopen

Big Data for the Real-Time Analysis of the Cherenkov Telescope Array Observatory

Author: Zollino Giancarlo
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 14/03/2019
Field of study

Lo scopo di questo lavoro di tesi è quello di progettare e sviluppare un framework che supporti l'analisi in tempo reale nel contesto del Cherenkov Telescope Array (CTA). CTA è un consorzio internazionale che comprende 1420 membri provenienti da oltre 200 istituti da 31 Nazioni. CTA punta ad essere il più grande e più sensibile osservatorio ground-based di raggi gamma di prossima generazione in grado di gestire un'elevata quantità di dati e un'alta velocità di trasmissione, compresa tra i 0,5 e i 10 GB/s, con una rate di acquisizione nominale di 6 kHz. A tale riguardo, è stata sviluppata la RTAlib in grado di fornire un'API semplice e ad alte prestazioni per archiviare o fare caching dei dati generati durante la fase di ricostruzione e analisi. Per far fronte alle elevate velocità di trasmissione di CTA, la RTAlib sfrutta il multiprocesso, il multi-threading, le transazioni ed un accesso trasparente a MySQL o Redis per far fronte a diversi casi d’uso. Tutte queste funzionalità sono state testate ottenendo risultati entro i requisiti richiesti. In particolare, con la libreria sviluppata si riesce a fare caching di dati con Redis, con processi scrittori e lettori che lavorano in parallelo, ad una rate di 8 kHz in scrittura e 30 kHz in lettura. Il team in cui ho lavorato ha basato sui principi dell'approccio Scrum e DevOps il proprio processo di sviluppo del software, in particolare dalle unit test fino alla continuous integration, utilizzando tools ad accesso pubblico su GitHub oppure tramite Jenkins. Grazie a questo approccio si è puntato ad avere una elevata qualità del codice fin dall’inizio del progetto, e questo è risultato uno degli approcci più importanti per ottenere i risultati raggiunti

AMS Tesi di Laurea