    Algebraic symmetries of generic (m+1)(m+1) dimensional periodic Costas arrays

    In this work we present two generators for the group of symmetries of the generic (m+1)(m+1) dimensional periodic Costas arrays over elementary abelian (Zp)m(\mathbb{Z}_p)^m groups: one that is defined by multiplication on mm dimensions and the other by shear (addition) on mm dimensions. Through exhaustive search we observe that these two generators characterize the group of symmetries for the examples we were able to compute. Following the results, we conjecture that these generators characterize the group of symmetries of the generic (m+1)(m+1) dimensional periodic Costas arrays over elementary abelian (Zp)m(\mathbb{Z}_p)^m groups

    Parallelization of dynamic programming recurrences in computational biology

    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Hardware Acceleration of Electronic Design Automation Algorithms

    With the advances in very large scale integration (VLSI) technology, hardware is going parallel. Software, which was traditionally designed to execute on single core microprocessors, now faces the tough challenge of taking advantage of this parallelism, made available by the scaling of hardware. The work presented in this dissertation studies the acceleration of electronic design automation (EDA) software on several hardware platforms such as custom integrated circuits (ICs), field programmable gate arrays (FPGAs) and graphics processors. This dissertation concentrates on a subset of EDA algorithms which are heavily used in the VLSI design flow, and also have varying degrees of inherent parallelism in them. In particular, Boolean satisfiability, Monte Carlo based statistical static timing analysis, circuit simulation, fault simulation and fault table generation are explored. The architectural and performance tradeoffs of implementing the above applications on these alternative platforms (in comparison to their implementation on a single core microprocessor) are studied. In addition, this dissertation also presents an automated approach to accelerate uniprocessor code using a graphics processing unit (GPU). The key idea is to partition the software application into kernels in an automated fashion, such that multiple instances of these kernels, when executed in parallel on the GPU, can maximally benefit from the GPU?s hardware resources. The work presented in this dissertation demonstrates that several EDA algorithms can be successfully rearchitected to maximally harness their performance on alternative platforms such as custom designed ICs, FPGAs and graphic processors, and obtain speedups upto 800X. The approaches in this dissertation collectively aim to contribute towards enabling the computer aided design (CAD) community to accelerate EDA algorithms on arbitrary hardware platforms

    Towards Closing the Programmability-Efficiency Gap using Software-Defined Hardware

    The past decade has seen the breakdown of two important trends in the computing industry: Moore’s law, an observation that the number of transistors in a chip roughly doubles every eighteen months, and Dennard scaling, that enabled the use of these transistors within a constant power budget. This has caused a surge in domain-specific accelerators, i.e. specialized hardware that deliver significantly better energy efficiency than general-purpose processors, such as CPUs. While the performance and efficiency of such accelerators are highly desirable, the fast pace of algorithmic innovation and non-recurring engineering costs have deterred their widespread use, since they are only programmable across a narrow set of applications. This has engendered a programmability-efficiency gap across contemporary platforms. A practical solution that can close this gap is thus lucrative and is likely to engender broad impact in both academic research and the industry. This dissertation proposes such a solution with a reconfigurable Software-Defined Hardware (SDH) system that morphs parts of the hardware on-the-fly to tailor to the requirements of each application phase. This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability. The dissertation first presents a fixed-function solution to accelerate sparse matrix multiplication, which forms the basis of many applications in graph analytics and scientific computing. The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate each phase of the algorithm. A proof-of-concept is then presented in the form of a prototyped 40 nm Complimentary Metal-Oxide Semiconductor (CMOS) chip that demonstrates energy efficiency and performance per die area improvements of 12.6x and 17.1x over a high-end CPU, and serves as a stepping stone towards a full SDH system. The next piece of the dissertation enhances the proposed hardware with reconfigurability of the dataflow and resource sharing modes, in order to extend acceleration support to a set of common parallelizable workloads. This reconfigurability lends the system the ability to cater to discrete data access and compute patterns, such as workloads with extensive data sharing and reuse, workloads with limited reuse and streaming access patterns, among others. Moreover, this system incorporates commercial cores and a prototyped software stack for CPU-level programmability. The proposed system is evaluated on a diverse set of compute-bound and memory-bound kernels that compose applications in the domains of graph analytics, machine learning, image and language processing. The evaluation shows average performance and energy-efficiency gains of 5.0x and 18.4x over the CPU. The final part of the dissertation proposes a runtime control framework that uses low-cost monitoring of hardware performance counters to predict the next best configuration and reconfigure the hardware, upon detecting a change in phase or nature of data within the application. In comparison to prior work, this contribution targets multicore CGRAs, uses low-overhead decision tree based predictive models, and incorporates reconfiguration cost-awareness into its policies. Compared to the best-average static (non-reconfiguring) configuration, the dynamically reconfigurable system achieves a 1.6x improvement in performance-per-Watt in the Energy-Efficient mode of operation, or the same performance with 23% lower energy in the Power-Performance mode, for SpMM across a suite of real-world inputs. The proposed reconfiguration mechanism itself outperforms the state-of-the-art approach for dynamic runtime control by up to 2.9x in terms of energy-efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169859/1/subh_1.pd

    Kiel Declarative Programming Days 2013

    This report contains the papers presented at the Kiel Declarative Programming Days 2013, held in Kiel (Germany) during September 11-13, 2013. The Kiel Declarative Programming Days 2013 unified the following events: * 20th International Conference on Applications of Declarative Programming and Knowledge Management (INAP 2013) * 22nd International Workshop on Functional and (Constraint) Logic Programming (WFLP 2013) * 27th Workshop on Logic Programming (WLP 2013) All these events are centered around declarative programming, an advanced paradigm for the modeling and solving of complex problems. These specification and implementation methods attracted increasing attention over the last decades, e.g., in the domains of databases and natural language processing, for modeling and processing combinatorial problems, and for high-level programming of complex, in particular, knowledge-based systems

    Efficient local search for Pseudo Boolean Optimization

    Algorithms and the Foundations of Software technolog

    Simulated Annealing

    The book contains 15 chapters presenting recent contributions of top researchers working with Simulated Annealing (SA). Although it represents a small sample of the research activity on SA, the book will certainly serve as a valuable tool for researchers interested in getting involved in this multidisciplinary field. In fact, one of the salient features is that the book is highly multidisciplinary in terms of application areas since it assembles experts from the fields of Biology, Telecommunications, Geology, Electronics and Medicine

    Efficient Methods for Finding Optimal Convolutional Self-Doubly Orthogonal Codes

    Résumé: Au cours des dernières années, la hausse sans précédent du nombre d'ultrabooks et d'appareils mobiles s'est accompagnée d'un besoin toujours croissant d'accès aux technologies permettant des communications sans-fil fiables et à haut débit. Pour atténuer ou éliminer les erreurs induites par les interférences et le bruit dans les canaux de communication, il est important de développer des systèmes de codage efficaces pour la correction d'erreurs. En effet, lors de communications de données numériques sur un canal ayant un faible rapport signal sur bruit, ces codes permettent de conserver un taux d'erreur faible tout en augmentant le débit des transmissions et/ou en diminuant la puissance d'émission requise. Ceci contribue grandement à améliorer l'efficacité énergétique de ces dispositifs électroniques sans-fil et, ainsi, à prolonger leur autonomie. Dans cette thèse par articles, nous présentons un algorithme de recherche efficace pour trouver deux types de codes correcteurs d'erreur: les codes convolutionnels doublement orthogonaux (CDO) et les codes convolutionnels doublement orthogonaux simplifiés (S-CDO). En effet, ces codes sont utilisés dans un système de contrôle d'erreurs ayant un décodage à seuil itératif différent de la procédure de décodage Turbo classique, puisqu'il ne nécessite aucun entrelaceur, ni à l'encodage, ni aux étapes de décodage. Néanmoins, son processus de décodage à seuil nécessite que ces codes convolutionnels systématiques satisfassent des propriétés dites de « double orthogonalité », allant au-delà des conditions requises par les codes « simplement orthogonaux », bien connus et habituellement utilisés lors d'un décodage à seuil non-itératif. Afin de pouvoir construire des codecs à haute performance et à faible latence avec ces codes, il est important de minimiser leur longueur de contrainte ou « span » pour un nombre J de connexions donné. Bien que trouver des codes CDO et S-CDO ne soit pas difficile, déterminer les codes ayant un span minimal (dit optimal) pour un ordre J donné est mathématiquement très complexe. En effet, la construction directe de codes CDO / S-CDO à span court/optimal reste un problème ouvert et qui est soupçonné d'être NP-complet. Cette thèse présente un total de trois articles: deux articles publiés dans IEEE Transactions on Communications et un article soumis au journal IEEE Transactions on Parallel and Distributed Systems . Dans ces articles, nous décrivons un nouvel algorithme de recherche parallèle, efficace et implicitement-exhaustif pour trouver des codes CDO et S-CDO systématiques, à taux R=1/2 et ayant un span plus court, voire minimal, c.à.d. optimal. Comparé à l'algorithme de recherche implicitement-exhaustif de référence, l'algorithme de recherche à haute performance proposé reste exhaustif mais fournit un facteur d'accélération très important, supérieur à 16300 pour les codes CDO (J=7) et supérieur à 6300 pour les codes S-CDO (J=8).----------Abstract: In recent years, the rise of ultrabooks and mobile devices has been accompanied by an ever increasing need for reliable high-bandwidth wireless communications. To mitigate or eliminate the errors that are invariably introduced due to noise and interference in the communication channels, it is important to develop efficient error-correcting coding schemes. Indeed, these codes may be used to preserve the error performance while allowing the data-rate of digital communications to be increased and the transmission power at lower signal-to-noise ratios to be reduced, thereby improving the overall power efficiency of these devices. In this manuscript-based thesis, we present an efficient search algorithm for finding optimal/short-span Convolutional Self-Doubly Orthogonal (CDO) codes and Simplified Convolutional Self-Doubly Orthogonal (S-CDO) codes. These error-correcting codes are employed in an iterative error-control coding scheme that differs from the classical Turbo code procedure, as it does not require any interleaver, neither at the encoding nor at the decoding stages. However, its iterative threshold decoding procedure requires that these systematic convolutional codes satisfy some “double orthogonality properties”, beyond those of the well-known orthogonal codes used in the usual non-iterative threshold decoding. In order to build high-performance, low-latency codecs with these codes, it is important to minimize the constraint length, also called “span”, for a given number J of generator connections. Although finding CDO/S-CDO codes is not difficult, determining the optimal/short-span codes for a given order J is computationally very challenging. The direct construction of optimal or shortest-span CDO and S-CDO codes has so far eluded analysis, and the search for these codes is believed to be an NP-complete problem. The thesis presents a total of three articles: two articles that were published in IEEE Transactions on Communications , and one article that was submitted for publication to IEEE Transactions on Parallel and Distributed Systems . In these articles, we describe a novel efficient and parallel implicitly-exhaustive search algorithm for finding rate R=1/2 systematic optimal/short-span CDO and S-CDO codes. The high-performance search algorithm is still exhaustive in nature, yet it provides an impressive speedup that is larger than 16300 (CDO, J=7) and 6300 (S-CDO, J=8) over the reference implicitly-exhaustive search algorithm, and larger than 2000 (CDO, J=17) over the fastest known CDO validation function used in high-performance pseudo-random search algorithms