64 research outputs found

    Efficient parallel processing with optical interconnections

    Get PDF
    With the advances in VLSI technology, it is now possible to build chips which can each contain thousands of processors. The efficiency of such chips in executing parallel algorithms heavily depends on the interconnection topology of the processors. It is not possible to build a fully interconnected network of processors with constant fan-in/fan-out using electrical interconnections. Free space optics is a remedy to this limitation. Qualities exclusive to the optical medium are its ability to be directed for propagation in free space and the property that optical channels can cross in space without any interference. In this thesis, we present an electro-optical interconnected architecture named Optical Reconfigurable Mesh (ORM). It is based on an existing optical model of computation. There are two layers in the architecture. The processing layer is a reconfigurable mesh and the deflecting layer contains optical devices to deflect light beams. ORM provides three types of communication mechanisms. The first is for arbitrary planar connections among sets of locally connected processors using the reconfigurable mesh. The second is for arbitrary connections among N of the processors using the electrical buses on the processing layer and N2 fixed passive deflecting units on the deflection layer. The third is for arbitrary connections among any of the N2 processors using the N2 mechanically reconfigurable deflectors in the deflection layer. The third type of communication mechanisms is significantly slower than the other two. Therefore, it is desirable to avoid reconfiguring this type of communication during the execution of the algorithms. Instead, the optical reconfiguration can be done before the execution of each algorithm begins. Determining a right configuration that would be suitable for the entire configuration of a task execution is studied in this thesis. The basic data movements for each of the mechanisms are studied. Finally, to show the power of ORM, we use all three types of communication mechanisms in the first O(logN) time algorithm for finding the convex hulls of all figures in an N x N binary image presented in this thesis

    On implementing dynamically reconfigurable architectures

    Get PDF
    Dynamically reconfigurable architectures have the ability to change their structure at each step of a computation. This dissertation studies various aspects of implementing dynamic reconfiguration, ranging from hardware building blocks and low-level architectures to modeling issues and high-level algorithm design. First we derive conditions under which classes of communication sets can be optimally scheduled on the circuit-switched tree (CST). Then we present a method to configure the CST to perform in constant time all communications scheduled for a step. This results in a constant time implementation of a step of a segmentable bus, a fundamental dynamically reconfigurable structure. We introduce a new bus delay measure (bends-cost) and define the bends-cost LR-Mesh; the LR-Mesh is a widely used reconfigurable model. Unlike the (idealized) LR-Mesh, which ignores bus delay, the bends-cost LR-Mesh uses the number of bends in a bus to estimate its delay. We present an implementation for which the bends-cost is an accurate estimate of the actual delay. We present algorithms to simulate various LR-Mesh configuration classes on the bends-cost LR-Mesh. For semimonotonic configurations, a Θ(N)*Θ(N) bends-cost LR-Mesh with bus delay at most D can simulate a step of the idealized N*N LR-Mesh in O((log N/(log D-log Δ))2) time (where Δ is the delay of an N-element segmentable bus), while employing about the same number of processors. For some special cases this time reduces to O(log N/(log D-log Δ)). If D=Nε, for an arbitrarily small constant ε \u3e 0, then the running times of bends-cost LR-Mesh algorithms are within a constant of their idealized counterparts. We also prove that with a polynomial blowup in the number of processors and D=Nε, the bends-cost LR-Mesh can simulate any step of an idealized LR-Mesh in constant time, thereby establishing that these models have the same power. We present an implementation (in VHDL) of the Enhanced Self Reconfigurable Gate Array (E-SRGA) architecture and perform a cost-benefit study for different dynamic reconfiguration features. This study shows our approach to be feasible

    Generic low power reconfigurable distributed arithmetic processor

    Get PDF
    Higher performance, lower cost, increasingly minimizing integrated circuit components, and higher packaging density of chips are ongoing goals of the microelectronic and computer industry. As these goals are being achieved, however, power consumption and flexibility are increasingly becoming bottlenecks that need to be addressed with the new technology in Very Large-Scale Integrated (VLSI) design. For modern systems, more energy is required to support the powerful computational capability which accords with the increasing requirements, and these requirements cause the change of standards not only in audio and video broadcasting but also in communication such as wireless connection and network protocols. Powerful flexibility and low consumption are repellent, but their combination in one system is the ultimate goal of designers. A generic domain-specific low-power reconfigurable processor for the distributed arithmetic algorithm is presented in this dissertation. This domain reconfigurable processor features high efficiency in terms of area, power and delay, which approaches the performance of an ASIC design, while retaining the flexibility of programmable platforms. The architecture not only supports typical distributed arithmetic algorithms which can be found in most still picture compression standards and video conferencing standards, but also offers implementation ability for other distributed arithmetic algorithms found in digital signal processing, telecommunication protocols and automatic control. In this processor, a simple reconfigurable low power control unit is implemented with good performance in area, power and timing. The generic characteristic of the architecture makes it applicable for any small and medium size finite state machines which can be used as control units to implement complex system behaviour and can be found in almost all engineering disciplines. Furthermore, to map target applications efficiently onto the proposed architecture, a new algorithm is introduced for searching for the best common sharing terms set and it keeps the area and power consumption of the implementation at low level. The software implementation of this algorithm is presented, which can be used not only for the proposed architecture in this dissertation but also for all the implementations with adder-based distributed arithmetic algorithms. In addition, some low power design techniques are applied in the architecture, such as unsymmetrical design style including unsymmetrical interconnection arranging, unsymmetrical PTBs selection and unsymmetrical mapping basic computing units. All these design techniques achieve extraordinary power consumption saving. It is believed that they can be extended to more low power designs and architectures. The processor presented in this dissertation can be used to implement complex, high performance distributed arithmetic algorithms for communication and image processing applications with low cost in area and power compared with the traditional methods

    Decoupling User Interface Design Using Libraries of Reusable Components

    Get PDF
    The integration of electronic and mechanical hardware, software and interaction design presents a challenging design space for researchers developing physical user interfaces and interactive artifacts. Currently in the academic research community, physical user interfaces and interactive artifacts are predominantly designed and prototyped either as one-off instances from the ground up, or using functionally rich hardware toolkits and prototyping systems. During this prototyping phase, undertaking an integral design of the interface or interactive artifact’s electronic hardware is frequently constraining due to the tight couplings between the different design realms and the typical need for iterations as the design matures. Several current toolkit designs have consequently embraced component-sharing and component-swapping modular designs with a view to extending flexibility and improving researcher freedom by disentangling and softening the cause-effect couplings. Encouraged by early successes of these toolkits, this research work strives to further enhance these freedoms by pursuing an alternative style and dimension of hardware modularity. Another motivation is our goal to facilitate the design and development of certain classes of interfaces and interactive artifacts for which current electronic design approaches are argued to be restrictively constraining (e.g., relating to scale and complexity). Unfortunately, this goal of a new platform architecture is met with conceptual and technical challenges on the embedded system networking front. In response, this research investigates and extends a growing field of multi-module distributed embedded systems. We identify and characterize a sub-class of these systems, calling them embedded aggregates. We then outline and develop a framework for realizing the embedded aggregate class of systems. Toward this end, this thesis examines several architectures, topologies and communication protocols, making the case for and substantial steps toward the development of a suite of networking protocols and control algorithms to support embedded aggregates. We define a set of protocols, mechanisms and communication packets that collectively form the underlying framework for the aggregates. Following the aggregates design, we develop blades and tiles to support user interface researchers

    Running stream-like programs on heterogeneous multi-core systems

    Get PDF
    All major semiconductor companies are now shipping multi-cores. Phones, PCs, laptops, and mobile internet devices will all require software that can make effective use of these cores. Writing high-performance parallel software is difficult, time-consuming and error prone, increasing both time-to-market and cost. Software outlives hardware; it typically takes longer to develop new software than hardware, and legacy software tends to survive for a long time, during which the number of cores per system will increase. Development and maintenance productivity will be improved if parallelism and technical details are managed by the machine, while the programmer reasons about the application as a whole. Parallel software should be written using domain-specific high-level languages or extensions. These languages reveal implicit parallelism, which would be obscured by a sequential language such as C. When memory allocation and program control are managed by the compiler, the program's structure and data layout can be safely and reliably modified by high-level compiler transformations. One important application domain contains so-called stream programs, which are structured as independent kernels interacting only through one-way channels, called streams. Stream programming is not applicable to all programs, but it arises naturally in audio and video encode and decode, 3D graphics, and digital signal processing. This representation enables high-level transformations, including kernel unrolling and kernel fusion. This thesis develops new compiler and run-time techniques for stream programming. The first part of the thesis is concerned with a statically scheduled stream compiler. It introduces a new static partitioning algorithm, which determines which kernels should be fused, in order to balance the loads on the processors and interconnects. A good partitioning algorithm is crucial if the compiler is to produce efficient code. The algorithm also takes account of downstream compiler passes---specifically software pipelining and buffer allocation---and it models the compiler's ability to fuse kernels. The latter is important because the compiler may not be able to fuse arbitrary collections of kernels. This thesis also introduces a static queue sizing algorithm. This algorithm is important when memory is distributed, especially when local stores are small. The algorithm takes account of latencies and variations in computation time, and is constrained by the sizes of the local memories. The second part of this thesis is concerned with dynamic scheduling of stream programs. First, it investigates the performance of known online, non-preemptive, non-clairvoyant dynamic schedulers. Second, it proposes two dynamic schedulers for stream programs. The first is specifically for one-dimensional stream programs. The second is more general: it does not need to be told the stream graph, but it has slightly larger overhead. This thesis also introduces some support tools related to stream programming. StarssCheck is a debugging tool, based on Valgrind, for the StarSs task-parallel programming language. It generates a warning whenever the program's behaviour contradicts a pragma annotation. Such behaviour could otherwise lead to exceptions or race conditions. StreamIt to OmpSs is a tool to convert a streaming program in the StreamIt language into a dynamically scheduled task based program using StarSs.Totes les empreses de semiconductors produeixen actualment multi-cores. Mòbils,PCs, portàtils, i dispositius mòbils d’Internet necessitaran programari quefaci servir eficientment aquests cores. Escriure programari paral·lel d’altrendiment és difícil, laboriós i propens a errors, incrementant tant el tempsde llançament al mercat com el cost. El programari té una vida més llarga queel maquinari; típicament pren més temps desenvolupar nou programi que noumaquinari, i el programari ja existent pot perdurar molt temps, durant el qualel nombre de cores dels sistemes incrementarà. La productivitat dedesenvolupament i manteniment millorarà si el paral·lelisme i els detallstècnics són gestionats per la màquina, mentre el programador raona sobre elconjunt de l’aplicació.El programari paral·lel hauria de ser escrit en llenguatges específics deldomini. Aquests llenguatges extrauen paral·lelisme implícit, el qual és ocultatper un llenguatge seqüencial com C. Quan l’assignació de memòria i lesestructures de control són gestionades pel compilador, l’estructura iorganització de dades del programi poden ser modificades de manera segura ifiable per les transformacions d’alt nivell del compilador.Un dels dominis de l’aplicació importants és el que consta dels programes destream; aquest programes són estructurats com a nuclis independents queinteractuen només a través de canals d’un sol sentit, anomenats streams. Laprogramació de streams no és aplicable a tots els programes, però sorgeix deforma natural en la codificació i descodificació d’àudio i vídeo, gràfics 3D, iprocessament de senyals digitals. Aquesta representació permet transformacionsd’alt nivell, fins i tot descomposició i fusió de nucli.Aquesta tesi desenvolupa noves tècniques de compilació i sistemes en tempsd’execució per a programació de streams. La primera part d’aquesta tesi esfocalitza amb un compilador de streams de planificació estàtica. Presenta unnou algorisme de partició estàtica, que determina quins nuclis han de serfusionats, per tal d’equilibrar la càrrega en els processadors i en lesinterconnexions. Un bon algorisme de particionat és fonamental per tal de queel compilador produeixi codi eficient. L’algorisme també té en compte elspassos de compilació subseqüents---específicament software pipelining il’arranjament de buffers---i modela la capacitat del compilador per fusionarnuclis. Aquesta tesi també presenta un algorisme estàtic de redimensionament de cues.Aquest algorisme és important quan la memòria és distribuïda, especialment quanles memòries locals són petites. L’algorisme té en compte latències ivariacions en els temps de càlcul, i considera el límit imposat per la mida deles memòries locals.La segona part d’aquesta tesi es centralitza en la planificació dinàmica deprogrames de streams. En primer lloc, investiga el rendiment dels planificadorsdinàmics online, non-preemptive i non-clairvoyant. En segon lloc, proposa dosplanificadors dinàmics per programes de stream. El primer és específicament pera programes de streams unidimensionals. El segon és més general: no necessitael graf de streams, però els overheads són una mica més grans.Aquesta tesi també presenta un conjunt d’eines de suport relacionades amb laprogramació de streams. StarssCheck és una eina de depuració, que és basa enValgrind, per StarSs, un llenguatge de programació paral·lela basat en tasques.Aquesta eina genera un avís cada vegada que el comportament del programa estàen contradicció amb una anotació pragma. Aquest comportament d’una altra manerapodria causar excepcions o situacions de competició. StreamIt to OmpSs és unaeina per convertir un programa de streams codificat en el llenguatge StreamIt aun programa de tasques en StarSs planificat de forma dinàmica.Postprint (published version

    Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations

    Get PDF
    Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences\u27 structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    Segurança de computadores por meio de autenticação intrínseca de hardware

    Get PDF
    Orientadores: Guido Costa Souza de Araújo, Mario Lúcio Côrtes e Diego de Freitas AranhaTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Neste trabalho apresentamos Computer Security by Hardware-Intrinsic Authentication (CSHIA), uma arquitetura de computadores segura para sistemas embarcados que tem como objetivo prover autenticidade e integridade para código e dados. Este trabalho está divido em três fases: Projeto da Arquitetura, sua Implementação, e sua Avaliação de Segurança. Durante a fase de projeto, determinamos como integridade e autenticidade seriam garantidas através do uso de Funções Fisicamente Não Clonáveis (PUFs) e propusemos um algoritmo de extração de chaves criptográficas de memórias cache de processadores. Durante a implementação, flexibilizamos o projeto da arquitetura para fornecer diferentes possibilidades de configurações sem comprometimento da segurança. Então, avaliamos seu desempenho levando em consideração o incremento em área de chip, aumento de consumo de energia e memória adicional para diferentes configurações. Por fim, analisamos a segurança de PUFs e desenvolvemos um novo ataque de canal lateral que circunvê a propriedade de unicidade de PUFs por meio de seus elementos de construçãoAbstract: This work presents Computer Security by Hardware-Intrinsic Authentication (CSHIA), a secure computer architecture for embedded systems that aims at providing authenticity and integrity for code and data. The work encompassed three phases: Design, Implementation, and Security Evaluation. In design, we laid out the basic ideas behind CSHIA, namely, how integrity and authenticity are employed through the use of Physical Unclonable Functions (PUFs), and we proposed an algorithm to extract cryptographic keys from the intrinsic memories of processors. In implementation, we made CSHIA¿s design more flexible, allowing different configurations without compromising security. Then, we evaluated CSHIA¿s performance and overheads, such as area, energy, and memory, for multiple configurations. Finally, we evaluated security of PUFs, which led us to develop a new side-channel-based attack that enabled us to circumvent PUFs¿ uniqueness property through their architectural elementsDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação2015/06829-2; 2016/25532-3147614/2014-7FAPESPCNP

    Network control for a multi-user transputer-based system.

    Get PDF
    A dissertation submitted to the Faculty of Engineering, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Master of Science in EngineeringThe MC2/64 system is a configureable multi-user transputer- based system which was designed using a modular approach. The MC2/64 consists of MC2 Clusters which are connected using a modified Clos network. The MC2 Clusters were designed and realised as completely configurable modules using and extending an algorithm based on Eulerian cycles through a requested graph. This dissertation discusses the configuration algorithm and the extensions made to the algorithm for the MC2 Clusters. The total MC2/64 system is not completely configurable as a MC2 Cluster releases only a limited number of links for inter-cluster connections. This dissertation analyses the configurability of MC2/64, but also presents algorithms which enhance the usability of the system from the user's point of view. The design and the implementation of the network control software are also submitted as topics in this dissertation. The network control software must allow multiple users to use the system, but without them influencing each other's transputer domains. This dissertation therefore seeks to give an overview of network control problems and the solutions implemented in current MC2/64 systems. The results of the research done for this dissertation will hopefully aid in the design of future MC2 systems which will provide South Africa with much needed, low cost, high performance computing power.Andrew Chakane 201

    High performance reconfigurable architectures for biological sequence alignment

    Get PDF
    Bioinformatics and computational biology (BCB) is a rapidly developing multidisciplinary field which encompasses a wide range of domains, including genomic sequence alignments. It is a fundamental tool in molecular biology in searching for homology between sequences. Sequence alignments are currently gaining close attention due to their great impact on the quality aspects of life such as facilitating early disease diagnosis, identifying the characteristics of a newly discovered sequence, and drug engineering. With the vast growth of genomic data, searching for a sequence homology over huge databases (often measured in gigabytes) is unable to produce results within a realistic time, hence the need for acceleration. Since the exponential increase of biological databases as a result of the human genome project (HGP), supercomputers and other parallel architectures such as the special purpose Very Large Scale Integration (VLSI) chip, Graphic Processing Unit (GPUs) and Field Programmable Gate Arrays (FPGAs) have become popular acceleration platforms. Nevertheless, there are always trade-off between area, speed, power, cost, development time and reusability when selecting an acceleration platform. FPGAs generally offer more flexibility, higher performance and lower overheads. However, they suffer from a relatively low level programming model as compared with off-the-shelf microprocessors such as standard microprocessors and GPUs. Due to the aforementioned limitations, the need has arisen for optimized FPGA core implementations which are crucial for this technology to become viable in high performance computing (HPC). This research proposes the use of state-of-the-art reprogrammable system-on-chip technology on FPGAs to accelerate three widely-used sequence alignment algorithms; the Smith-Waterman with affine gap penalty algorithm, the profile hidden Markov model (HMM) algorithm and the Basic Local Alignment Search Tool (BLAST) algorithm. The three novel aspects of this research are firstly that the algorithms are designed and implemented in hardware, with each core achieving the highest performance compared to the state-of-the-art. Secondly, an efficient scheduling strategy based on the double buffering technique is adopted into the hardware architectures. Here, when the alignment matrix computation task is overlapped with the PE configuration in a folded systolic array, the overall throughput of the core is significantly increased. This is due to the bound PE configuration time and the parallel PE configuration approach irrespective of the number of PEs in a systolic array. In addition, the use of only two configuration elements in the PE optimizes hardware resources and enables the scalability of PE systolic arrays without relying on restricted onboard memory resources. Finally, a new performance metric is devised, which facilitates the effective comparison of design performance between different FPGA devices and families. The normalized performance indicator (speed-up per area per process technology) takes out advantages of the area and lithography technology of any FPGA resulting in fairer comparisons. The cores have been designed using Verilog HDL and prototyped on the Alpha Data ADM-XRC-5LX card with the Virtex-5 XC5VLX110-3FF1153 FPGA. The implementation results show that the proposed architectures achieved giga cell updates per second (GCUPS) performances of 26.8, 29.5 and 24.2 respectively for the acceleration of the Smith-Waterman with affine gap penalty algorithm, the profile HMM algorithm and the BLAST algorithm. In terms of speed-up improvements, comparisons were made on performance of the designed cores against their corresponding software and the reported FPGA implementations. In the case of comparison with equivalent software execution, acceleration of the optimal alignment algorithm in hardware yielded an average speed-up of 269x as compared to the SSEARCH 35 software. For the profile HMM-based sequence alignment, the designed core achieved speed-up of 103x and 8.3x against the HMMER 2.0 and the latest version of HMMER (version 3.0) respectively. On the other hand, the implementation of the gapped BLAST with the two-hit method in hardware achieved a greater than tenfold speed-up compared to the latest NCBI BLAST software. In terms of comparison against other reported FPGA implementations, the proposed normalized performance indicator was used to evaluate the designed architectures fairly. The results showed that the first architecture achieved more than 50 percent improvement, while acceleration of the profile HMM sequence alignment in hardware gained a normalized speed-up of 1.34. In the case of the gapped BLAST with the two-hit method, the designed core achieved 11x speed-up after taking out advantages of the Virtex-5 FPGA. In addition, further analysis was conducted in terms of cost and power performances; it was noted that, the core achieved 0.46 MCUPS per dollar spent and 958.1 MCUPS per watt. This shows that FPGAs can be an attractive platform for high performance computation with advantages of smaller area footprint as well as represent economic ‘green’ solution compared to the other acceleration platforms. Higher throughput can be achieved by redeploying the cores on newer, bigger and faster FPGAs with minimal design effort
    • …
    corecore