    LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip Multiprocessors

    To meet the needs of high performance computing, the Cell Broadband Engine owns many features that differ from traditional processors, such as the large number of synergistic processor elements, large register files, the ability to hide main-storage latency with concurrent computation and DMA transfers. The exploitation of those features requires the programmer to carefully tailor programs and simutaneously deal with various performance factors, including locality, load balance, communication overhead, and multi-level parallelism. These factors, unfortunately, are dependent on each other; an optimization that enhances one factor may degrade another. This paper presents our experience on optimizing LU decomposition, one of the commonly used algebra kernels in scientific computing, on Cell Broadband Engine. The optimizations exploit task-level, data-level, and communication-level parallelism. We study the effects of different task distribution strategies, prefetch, and software cache, and explore the tradeoff among different performance factors, stressing the interactions between different optimizations. This work offers some insights in the optimizations on heterogenous multi-core processors, including the selection of programming models, considerations in task distribution, and the holistic perspective required in optimizations

    Dynamic code mapping for limited local memory systems

    Abstract—This paper presents heuristics for dynamic man-agement of application code on limited local memories present in high-performance multi-core processors. Previous techniques formulate the problem using call graphs, which do not capture the temporal ordering of functions. In addition, they only use a conservative estimate of the interference cost between functions to obtain a mapping. As a result previous techniques are unable to achieve efficient code mapping. Techniques proposed in this paper overcome both these limitations and achieve superior code mapping. Experimental results from executing benchmarks from MiBench onto the Cell processor in the Sony Playstation 3 demonstrate upto 29 % and average 12 % performance improve-ment, at tolerable compile-time overhead. I

    Trace-based Performance Analysis for Hardware Accelerators

    This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods.Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat

    Perception-motivated parallel algorithms for haptics

    Negli ultimi anni l\u2019utilizzo di dispositivi aptici, atti cio\ue8 a riprodurre l\u2019interazione fisica con l\u2019ambiente remoto o virtuale, si sta diffondendo in vari ambiti della robotica e dell\u2019informatica, dai videogiochi alla chirurgia robotizzata eseguita in teleoperazione, dai cellulari alla riabilitazione. In questo lavoro di tesi abbiamo voluto considerare nuovi punti di vista sull\u2019argomento, allo scopo di comprendere meglio come riportare l\u2019essere umano, che \ue8 l\u2019unico fruitore del ritorno di forza, tattile e di telepresenza, al centro della ricerca sui dispositivi aptici. Allo scopo ci siamo focalizzati su due aspetti: una manipolazione del segnale di forza mutuata dalla percezione umana e l\u2019utilizzo di architetture multicore per l\u2019implementazione di algoritmi aptici e robotici. Con l\u2019aiuto di un setup sperimentale creato ad hoc e attraverso l\u2019utilizzo di un joystick con ritorno di forza a 6 gradi di libert\ue0, abbiamo progettato degli esperimenti psicofisici atti all\u2019identificazione di soglie differenziali di forze/coppie nel sistema mano-braccio. Sulla base dei risultati ottenuti abbiamo determinato una serie di funzioni di scalatura del segnale di forza, una per ogni grado di libert\ue0, che permettono di aumentare l\u2019abilit\ue0 umana nel discriminare stimoli differenti. L\u2019utilizzo di tali funzioni, ad esempio in teleoperazione, richiede la possibilit\ue0 di variare il segnale di feedback e il controllo del dispositivo sia in relazione al lavoro da svolgere, sia alle peculiari capacit\ue0 dell\u2019utilizzatore. La gestione del dispositivo deve quindi essere in grado di soddisfare due obbiettivi tendenzialmente in contrasto, e cio\ue8 il raggiungimento di alte prestazioni in termini di velocit\ue0, stabilit\ue0 e precisione, abbinato alla flessibilit\ue0 tipica del software. Una soluzione consiste nell\u2019affidare il controllo del dispositivo ai nuovi sistemi multicore che si stanno sempre pi\uf9 prepotentemente affacciando sul panorama informatico. Per far ci\uf2 una serie di algoritmi consolidati deve essere portata su sistemi paralleli. In questo lavoro abbiamo dimostrato che \ue8 possibile convertire facilmente vecchi algoritmi gi\ue0 implementati in hardware, e quindi intrinsecamente paralleli. Un punto da definire rimane per\uf2 quanto costa portare degli algoritmi solitamente descritti in VLSI e schemi in un linguaggio di programmazione ad alto livello. Focalizzando la nostra attenzione su un problema specifico, la pseudoinversione di matrici che \ue8 presente in molti algoritmi di dinamica e cinematica, abbiamo mostrato che un\u2019attenta progettazione e decomposizione del problema permette una mappatura diretta sulle unit\ue0 di calcolo disponibili. In aggiunta, l\u2019uso di parallelismo a livello di dati su macchine SIMD permette di ottenere buone prestazioni utilizzando semplici operazioni vettoriali come addizioni e shift. Dato che di solito tali istruzioni fanno parte delle implementazioni hardware la migrazione del codice risulta agevole. Abbiamo testato il nostro approccio su una Sony PlayStation 3 equipaggiata con un processore IBM Cell Broadband Engine.In the last years the use of haptic feedback has been used in several applications, from mobile phones to rehabilitation, from video games to robotic aided surgery. The haptic devices, that are the interfaces that create the stimulation and reproduce the physical interaction with virtual or remote environments, have been studied, analyzed and developed in many ways. Every innovation in the mechanics, electronics and technical design of the device it is valuable, however it is important to maintain the focus of the haptic interaction on the human being, who is the only user of force feedback. In this thesis we worked on two main topics that are relevant to this aim: a perception based force signal manipulation and the use of modern multicore architectures for the implementation of the haptic controller. With the help of a specific experimental setup and using a 6 dof haptic device we designed a psychophysical experiment aimed at identifying of the force/torque differential thresholds applied to the hand-arm system. On the basis of the results obtained we determined a set of task dependent scaling functions, one for each degree of freedom of the three-dimensional space, that can be used to enhance the human abilities in discriminating different stimuli. The perception based manipulation of the force feedback requires a fast, stable and configurable controller of the haptic interface. Thus a solution is to use new available multicore architectures for the implementation of the controller, but many consolidated algorithms have to be ported to these parallel systems. Focusing on specific problem, i.e. the matrix pseudoinversion, that is part of the robotics dynamic and kinematic computation, we showed that it is possible to migrate code that was already implemented in hardware, and in particular old algorithms that were inherently parallel and thus not competitive on sequential processors. The main question that still lies open is how much effort is required in order to write these algorithms, usually described in VLSI or schematics, in a modern programming language. We show that a careful task decomposition and design permit a mapping of the code on the available cores. In addition, the use of data parallelism on SIMD machines can give good performance when simple vector instructions such as add and shift operations are used. Since these instructions are present also in hardware implementations the migration can be easily performed. We tested our approach on a Sony PlayStation 3 game console equipped with IBM Cell Broadband Engine processor

    Profile-driven parallelisation of sequential programs

    Traditional parallelism detection in compilers is performed by means of static analysis and more specifically data and control dependence analysis. The information that is available at compile time, however, is inherently limited and therefore restricts the parallelisation opportunities. Furthermore, applications written in C – which represent the majority of today’s scientific, embedded and system software – utilise many lowlevel features and an intricate programming style that forces the compiler to even more conservative assumptions. Despite the numerous proposals to handle this uncertainty at compile time using speculative optimisation and parallelisation, the software industry still lacks any pragmatic approaches that extracts coarse-grain parallelism to exploit the multiple processing units of modern commodity hardware. This thesis introduces a novel approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C. We utilise profiling information to overcome the limitations of static data and control-flow analysis enabling more aggressive parallelisation. Profiling is performed using an instrumentation scheme operating at the Intermediate Representation (Ir) level of the compiler. In contrast to existing approaches that depend on low-level binary tools and debugging information, Ir-profiling provides precise and direct correlation of profiling information back to the Ir structures of the compiler. Additionally, our approach is orthogonal to existing automatic parallelisation approaches and additional fine-grain parallelism may be exploited. We demonstrate the applicability and versatility of the proposed methodology using two studies that target different forms of parallelism. First, we focus on the exploitation of loop-level parallelism that is abundant in many scientific and embedded applications. We evaluate our parallelisation strategy against the Nas and Spec Fp benchmarks and two different multi-core platforms (a shared-memory Intel Xeon Smp and a heterogeneous distributed-memory Ibm Cell blade). Empirical evaluation shows that our approach not only yields significant improvements when compared with state-of- the-art parallelising compilers, but comes close to and sometimes exceeds the performance of manually parallelised codes. On average, our methodology achieves 96% of the performance of the hand-tuned parallel benchmarks on the Intel Xeon platform, and a significant speedup for the Cell platform. The second study, addresses the problem of partially sequential loops, typically found in implementations of multimedia codecs. We develop a more powerful whole-program representation based on the Program Dependence Graph (Pdg) that supports profiling, partitioning and codegeneration for pipeline parallelism. In addition we demonstrate how this enhances conventional pipeline parallelisation by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. Experimental results using a set of complex multimedia and stream processing benchmarks confirm the effectiveness of the proposed methodology that yields speedups up to 4.7 on a eight-core Intel Xeon machine

    Software caching techniques and hardware optimizations for on-chip local memories

    Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache.Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals

    Design and implementation of an array language for computational science on a heterogeneous multicore architecture

    The packing of multiple processor cores onto a single chip has become a mainstream solution to fundamental physical issues relating to the microscopic scales employed in the manufacture of semiconductor components. Multicore architectures provide lower clock speeds per core, while aggregate floating-point capability continues to increase. Heterogeneous multicore chips, such as the Cell Broadband Engine (CBE) and modern graphics chips, also address the related issue of an increasing mismatch between high processor speeds, and huge latency to main memory. Such chips tackle this memory wall by the provision of addressable caches; increased bandwidth to main memory; and fast thread context switching. An associated cost is often reduced functionality of the individual accelerator cores; and the increased complexity involved in their programming. This dissertation investigates the application of a programming language supporting the first-class use of arrays; and capable of automatically parallelising array expressions; to the heterogeneous multicore domain of the CBE, as found in the Sony PlayStation 3 (PS3). The language is a pre-existing and well-documented proper subset of Fortran, known as the ‘F’ programming language. A bespoke compiler, referred to as E , is developed to support this aim, and written in the Haskell programming language. The output of the compiler is in an extended C++ dialect known as Offload C++, which targets the PS3. A significant feature of this language is its use of multiple, statically typed, address spaces. By focusing on generic, polymorphic interfaces for both the generated and hand constructed code, a number of interesting design patterns relating to the memory locality are introduced. A suite of medium-sized (100-700 lines), real-world benchmark programs are used to evaluate the performance, correctness, and scalability of the compiler technology. Absolute speedup values, well in excess of one, are observed for all of the programs. The work ultimately demonstrates that an array language can significantly reduce the effort expended to utilise a parallel heterogeneous multicore architecture, while retaining high performance. A substantial, related advantage in using standard ‘F’ is that any Fortran compiler can create debuggable, and competitively performing serial programs

    Optimisation multi-niveau d'une application de traitement d'images sur machines parallèles

    Cette thèse vise à définir une méthodologie de mise en œuvre d applications performantes sur les processeurs embarqués du futur. Ces architectures nécessitent notamment d exploiter au mieux les différents niveaux de parallélisme (grain fin, gros grain) et de gérer les communications et les accès à la mémoire. Pour étudier cette méthodologie, nous avons utilisé un processeur cible représentatif de ces architectures émergentes, le processeur CELL. Le détecteurde points d intérêt de Harris est un exemple de traitement régulier nécessitant des unités de calcul intensif. En étudiant plusieurs schémas de mise en oeuvre sur le processeur CELL, nous avons ainsi pu mettre en évidence des méthodes d optimisation des calculs en adaptant les programmes aux unités spécifiques de traitement SIMD du processeur CELL. L utilisation efficace de la mémoire nécessite par ailleurs, à la fois une bonne exploitation des transferts et un arrangement optimal des données en mémoire. Nous avons développé un outil d abstraction permettant de simplifier et d automatiser les transferts et la synchronisation, CELL MPI. Cette expertise nous a permis de développer une méthodologie permettant de simplifier la mise en oeuvre parallèle optimisée de ces algorithmes. Nous avons ainsi conçu un outil de programmation parallèle à base de squelettes algorithmiques : SKELL BE. Ce modèle de programmation propose une solution originale de génération d applications à base de métaprogrammation. Il permet, de manière automatisée, d obtenir de très bonnes performances et de permettre une utilisation efficace de l architecture, comme le montre la comparaison pour un ensemble de programmes test avec plusieurs autres outils dédiés à ce processeur.This thesis aims to define a design methodology for high performance applications on future embedded processors. These architectures require an efficient usage of their different level of parallelism (fine-grain, coarse-grain), and a good handling of the inter-processor communications and memory accesses. In order to study this methodology, we have used a target processor which represents this type of emerging architectures, the Cell BE processor.We have also chosen a low level image processing application, the Harris points of interest detector, which is representative of a typical low level image processing application that is highly parallel. We have studied several parallelisation schemes of this application and we could establish different optimisation techniques by adapting the software to the specific SIMD units of the Cell processor. We have also developped a library named CELL MPI that allows efficient communication and synchronisation over the processing elements, using a simplified and implicit programming interface. This work allowed us to develop a methodology that simplifies the design of a parallel algorithm on the Cell processor.We have designed a parallel programming tool named SKELL BE which is based on algorithmic skeletons. This programming model providesan original solution of a meta-programming based code generator. Using SKELL BE, we can obtain very high performances applications that uses the Cell architecture efficiently when compared to other tools that exist on the market.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    Parallelisierung von Algorithmen zur Nutzung auf Architekturen mit Teilwortparallelität

    Der technologische Fortschritt gestattet die Implementierung zunehmend komplexerer Prozessorarchitekturen auf einem Schaltkreis. Ein Trend der letzten Jahre ist die Implementierung von mehr und mehr Verarbeitungseinheiten auf einem Chip. Daraus ergeben sich neue Herausforderungen für die Abbildung von Algorithmen auf solche Architekturen, denn alle Verarbeitungseinheiten sollen effizient bei der Ausführung des Algorithmus genutzt werden. Der Schwerpunkt der eingereichten Dissertation ist die Ausnutzung der Parallelität von Rechenfeldern mit Teilwortparallelität. Solche Architekturen erlauben Parallelverarbeitung auf mehreren Ebenen. Daher wurde eine Abbildungsstrategie, mit besonderem Schwerpunkt auf Teilwortparallelität entwickelt. Diese Abbildungsstrategie basiert auf den Methoden des Rechenfeldentwurfs. Rechenfelder sind regelmäßig angeordnete Prozessorelemente, die nur mit ihren Nachbarelementen kommunizieren. Die Datenein- und -ausgabe wird durch die Prozessorelemente am Rand des Rechenfeldes realisiert. Jedes Prozessorelement kann mehrere Funktionseinheiten besitzen, welche die Rechenoperationen des Algorithmus ausführen. Die Teilwortparallelität bezeichnet die Fähigkeit zur Teilung des Datenpfads der Funktionseinheit in mehrere schmale Datenpfade für die parallele Ausführung von Daten mit geringer Wortbreite. Die entwickelte Abbildungsstrategie unterteilt sich in zwei Schritte, die \"Vorverarbeitung\" und die \"Mehrstufige Modifizierte Copartitionierung\" (kurz: MMC). Die \"Vorverarbeitung\" verändert den Algorithmus in einer solchen Art, dass der veränderte Algorithmus schnell und effizient auf die Zielarchitektur abgebildet werden kann. Hierfür wurde ein Optimierungsproblem entwickelt, welches schrittweise die Parameter für die Transformation des Algorithmus bestimmt. Die \"Mehrstufige Modifizierte Copartitionierung\" wird für die schrittweise Anpassung des Algorithmus an die Zielarchitektur eingesetzt. Darüber hinaus ermöglicht die Abbildungsmethode die Ausnutzung der lokalen Register in den Prozessorelementen und die Anpassung des Algorithmus an die Speicherarchitektur, an die das Rechenfeld angebunden ist. Die erste Stufe der MMC dient der Transformation eines Algorithmus mit Einzeldatenoperationen in einen Algorithmus mit teilwortparallelen Operationen. Mit der zweiten Copartitionierungsstufe wird der Algorithmus an die lokalen Register und an das Rechenfeld angepasst. Weitere Copartitionierungsstufen können zur Anpassung des Algorithmus an die Speicherarchitektur verwendet werden.The technological progress allows the implementation of complex processor architectures on a chip. One trend of the last years is the implemenation of more and more execution units on one chip. That implies new challenges for the mapping of algorithms on such architectures, because the execution units should be used efficiently during the execution of the algorithm. The focus of the submitted dissertation thesis is the utilization of the parallelism of processor arrays with subword parallelism. Such architectures allow parallel executions on different levels. Therefore an algorithm mapping strategy was developed, where the exploitation of the subword parallelism was in the focus. This algorithm mapping strategy is based on the methods of the processor array design. Processor arrays are regular arranged processor elements, which communicate with their neighbors elements only. The data in- and output will be realized by the processor elements on the border of the array. Each processor element can have several functional units, which execute the computational operations. Subword parallelism means the capability for splitting the data path of the functional units in several smaller chunks for the parallel execution of data with lower word width. The developed mapping strategy is subdivided in two steps, the \"Preprocessing\" and the \"Multi-Level Modified Copartitioning\" (kurz: MMC), whereat the MMC means the method of the step simultaneously. The \"Preprocessing\" alter the algorithm in such a kind, that the altered algorithm can be fast and efficient mapped on the target architecture. Therefore an optimization problem was developed, which determines gradual the parameter for the transformation of the algorithm. The \"Multi-Level Modified Copartitioning\" is used for mapping the algorithm gradual on the target architecture. Furthermore the mapping methodology allows the exploitation of the local registers in the processing elements and the adaptation of the algorithm on the memory architecture, where the processing array is connected on. The first level of the MMC is used for the transformation of an algorithm with operation based on single data to an algorithm with subword parallel operations. With the second level, the algorithm will be adapted to the local registers in the processing elements and to the processor array. Further copartition levels can be used for matching the algorithm to the memory architecture