3,693 research outputs found

    A similarity criterion for sequential programs using truth-preserving partial functions

    Full text link
    The execution of sequential programs allows them to be represented using mathematical functions formed by the composition of statements following one after the other. Each such statement is in itself a partial function, which allows only inputs satisfying a particular Boolean condition to carry forward the execution and hence, the composition of such functions (as a result of sequential execution of the statements) strengthens the valid set of input state variables for the program to complete its execution and halt succesfully. With this thought in mind, this paper tries to study a particular class of partial functions, which tend to preserve the truth of two given Boolean conditions whenever the state variables satisfying one are mapped through such functions into a domain of state variables satisfying the other. The existence of such maps allows us to study isomorphism between different programs, based not only on their structural characteristics (e.g. the kind of programming constructs used and the overall input-output transformation), but also the nature of computation performed on seemingly different inputs. Consequently, we can now relate programs which perform a given type of computation, like a loop counting down indefinitely, without caring about the input sets they work on individually or the set of statements each program contains.Comment: Submitted as term paper in 201

    ISA Mapper: A Compute and Hardware Agnostic Deep Learning Compiler

    Full text link
    Domain specific accelerators present new challenges and opportunities for code generation onto novel instruction sets, communication fabrics, and memory architectures. In this paper we introduce an intermediate representation (IR) which enables both deep learning computational kernels and hardware capabilities to be described in the same IR. We then formulate and apply instruction mapping to determine the possible ways a computation can be performed on a hardware system. Next, our scheduler chooses a specific mapping and determines the data movement and computation order. In order to manage the large search space of mappings and schedules, we developed a flexible framework that allows heuristics, cost models, and potentially machine learning to facilitate this search problem. With this system, we demonstrate the automated extraction of matrix multiplication kernels out of recent deep learning kernels such as depthwise-separable convolution. In addition, we demonstrate two to five times better performance on DeepBench sized GEMMs and GRU RNN execution when compared to state-of-the-art (SOTA) implementations on new hardware and up to 85% of the performance for SOTA implementations on existing hardware

    Vsep-New Heuristic and Exact Algorithms for Graph Automorphism Group Computation

    Full text link
    One exact and two heuristic algorithms for determining the generators, orbits and order of the graph automorphism group are presented. A basic tool of these algorithms is the well-known individualization and refinement procedure. A search tree is used in the algorithms - each node of the tree is a partition. All nonequivalent discreet partitions derivative of the selected vertices are stored in a coded form. A new strategy is used in the exact algorithm: if during its execution some of the searched or intermediate variables obtain a wrong value then the algorithm continues from a new start point losing some of the results determined so far. The algorithms has been tested on one of the known benchmark graphs and shows lower running times for some graph families. The heuristic versions of the algorithms are based on determining some number of discreet partitions derivative of each vertex in the selected cell of the initial partition and comparing them for an automorphism - their search trees are reduced. The heuristic algorithms are almost exact and are many times faster than the exact one. The experimental tests exhibit that the worst-cases running time of the exact algorithm is exponential but it is polynomial for the heuristic algorithms. Several cell selectors are used. Some of them are new. We also use a chooser of cell selector for choosing the optimal cell selector for the manipulated graph. The proposed heuristic algorithms use two main heuristic procedures that generate two different forests of search trees.Comment: 47 pages; 1. Entirely revised 2. Algorithms analysis removed 3. New algorithm versions added, one version removed 4. Changed algorithm COMP - cases CS2/CS4 are solved in a new wa

    DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

    Full text link
    The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computation-bound and I/O-bound. FPGA-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We propose the full-stack compiler DNNVM, which is an integration of optimizers for graphs, loops and data layouts, and an assembler, a runtime supporter and a validation environment. The DNNVM works in the context of deep learning frameworks and transforms CNN models into the directed acyclic graph: XGraph. Based on XGraph, we transform the optimization challenges for both the data layout and pipeline into graph-level problems. DNNVM enumerates all potentially profitable fusion opportunities by a heuristic subgraph isomorphism algorithm to leverage pipeline and data layout optimizations, and searches for the best choice of execution strategies of the whole computing graph. On the Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art performance on our benchmarks by na\"ive implementations without optimizations, and the throughput is further improved up to 1.26x by leveraging heterogeneous optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38 TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.Comment: 18 pages, 9 figures, 5 table

    Automatic Library Version Identification, an Exploration of Techniques

    Full text link
    This paper is the result of a two month research internship on the topic of library version identification. In this paper, ideas and techniques from literature in the area of binary comparison and fingerprinting are outlined and applied to the problem of (version) identification of shared libraries and of libraries within statically linked binary executables. Six comparison techniques are chosen and implemented in an open-source tool which in turn makes use of the open-source radare2 framework for signature generation. The effectiveness of the techniques is empirically analyzed by comparing both artificial and real sample files against a reference dataset of multiple versions of dozens of libraries. The results show that out of these techniques, readable string--based techniques perform the best and that one of these techniques correctly identifies multiple libraries contained in a stripped statically linked executable file.Comment: 9 pages, short technical repor

    Design space exploration tools for the ByoRISC configurable processor family

    Full text link
    In this paper, the ByoRISC (Build your own RISC) configurable application-specific instruction-set processor (ASIP) family is presented. ByoRISCs, as vendor-independent cores, provide extensive architectural parameters over a baseline processor, which can be customized by application-specific hardware extensions (ASHEs). Such extensions realize multi-input multi-output (MIMO) custom instructions with local state and load/store accesses to the data memory. ByoRISCs incorporate a true multi-port register file, zero-overhead custom instruction decoding, and scalable data forwarding mechanisms. Given these design decisions, ByoRISCs provide a unique combination of features that allow their use as architectural testbeds and the seamless and rapid development of new high-performance ASIPs. The performance characteristics of ByoRISCs, implemented as vendor-independent cores, have been evaluated for both ASIC and FPGA implementations, and it is proved that they provide a viable solution in FPGA-based system-on-a-chip design. A case study of an image processing pipeline is also presented to highlight the process of utilizing a ByoRISC custom processor. A peak performance speedup of up to 8.5×\times can be observed, whereas an average performance speedup of 4.4×\times on Xilinx Virtex-4 targets is achieved. In addition, ByoRISC outperforms an experimental VLIW architecture named VEX even in its 16-wide configuration for a number of data-intensive application kernels.Comment: 12 pages, 14 figures, 7 tables. Unpublished paper on ByoRISC, an extensible RISC with MIMO CIs that can outperform most mid-range VLIWs. Unfortunately Prof. Jorg Henkel destroyed the potential of this submission by using immoral tactics (neglecting his conflict of interest, changing reviewers accepting the paper, and requesting impossible additions for the average lifetime of an Earthlin

    Deep Learning Based Cryptographic Primitive Classification

    Full text link
    Cryptovirological augmentations present an immediate, incomparable threat. Over the last decade, the substantial proliferation of crypto-ransomware has had widespread consequences for consumers and organisations alike. Established preventive measures perform well, however, the problem has not ceased. Reverse engineering potentially malicious software is a cumbersome task due to platform eccentricities and obfuscated transmutation mechanisms, hence requiring smarter, more efficient detection strategies. The following manuscript presents a novel approach for the classification of cryptographic primitives in compiled binary executables using deep learning. The model blueprint, a DCNN, is fittingly configured to learn from variable-length control flow diagnostics output from a dynamic trace. To rival the size and variability of contemporary data compendiums, hence feeding the model cognition, a methodology for the procedural generation of synthetic cryptographic binaries is defined, utilising core primitives from OpenSSL with multivariate obfuscation, to draw a vastly scalable distribution. The library, CryptoKnight, rendered an algorithmic pool of AES, RC4, Blowfish, MD5 and RSA to synthesis combinable variants which are automatically fed in its core model. Converging at 91% accuracy, CryptoKnight is successfully able to classify the sample algorithms with minimal loss.Comment: 9 Pages, 6 Figure

    The Power of Distributed Verifiers in Interactive Proofs

    Full text link
    We explore the power of interactive proofs with a distributed verifier. In this setting, the verifier consists of nn nodes and a graph GG that defines their communication pattern. The prover is a single entity that communicates with all nodes by short messages. The goal is to verify that the graph GG belongs to some language in a small number of rounds, and with small communication bound, i.e., the proof size. This interactive model was introduced by Kol, Oshman and Saxena (PODC 2018) as a generalization of non-interactive distributed proofs. They demonstrated the power of interaction in this setting by constructing protocols for problems as Graph Symmetry and Graph Non-Isomorphism -- both of which require proofs of Ω(n2)\Omega(n^2)-bits without interaction. In this work, we provide a new general framework for distributed interactive proofs that allows one to translate standard interactive protocols to ones where the verifier is distributed with short proof size. We show the following: * Every (centralized) computation that can be performed in time O(n)O(n) can be translated into three-round distributed interactive protocol with O(logn)O(\log n) proof size. This implies that many graph problems for sparse graphs have succinct proofs. * Every (centralized) computation implemented by either a small space or by uniform NC circuit can be translated into a distributed protocol with O(1)O(1) rounds and O(logn)O(\log n) bits proof size for the low space case and polylog(n)polylog(n) many rounds and proof size for NC. * We show that for Graph Non-Isomorphism, there is a 4-round protocol with O(logn)O(\log n) proof size, improving upon the O(nlogn)O(n \log n) proof size of Kol et al. * For many problems we show how to reduce proof size below the naturally seeming barrier of logn\log n. We get a 5-round protocols with proof size O(loglogn)O(\log \log n) for a family of problems

    IFC Inside: Retrofitting Languages with Dynamic Information Flow Control (Extended Version)

    Full text link
    Many important security problems in JavaScript, such as browser extension security, untrusted JavaScript libraries and safe integration of mutually distrustful websites (mash-ups), may be effectively addressed using an efficient implementation of information flow control (IFC). Unfortunately existing fine-grained approaches to JavaScript IFC require modifications to the language semantics and its engine, a non-goal for browser applications. In this work, we take the ideas of coarse-grained dynamic IFC and provide the theoretical foundation for a language-based approach that can be applied to any programming language for which external effects can be controlled. We then apply this formalism to server- and client-side JavaScript, show how it generalizes to the C programming language, and connect it to the Haskell LIO system. Our methodology offers design principles for the construction of information flow control systems when isolation can easily be achieved, as well as compositional proofs for optimized concrete implementations of these systems, by relating them to their isolated variants.Comment: Extended version of POST'15 paper; 31 page

    Implementing distributed {\lambda}-calculus interpreter

    Full text link
    This paper describes how one can implement distributed {\lambda}-calculus interpreter from scratch. At first, we describe how to implement a monadic parser, than the Krivine Machine is introduced for the interpretation part and as for distribution, the actor model is used. In this work we are not providing general solution for parallelism, but we consider particular patterns, which always can be parallelized. As a result, the basic extensible implementation of call-by-name distributed machine is introduced and prototype is presented. We achieved computation speed improvement in some cases, but efficient distributed version is not achieved, problems are discussed in evaluation section. This work provides a foundation for further research, completing the implementation it is possible to add concurrency for non-determinism, improve the interpreter using call-by-need semantic or study optimal auto parallelization to generalize what could be done efficiently in parallel.Comment: 8 pages, 4 tables, 1 figure, proceeding AINA-2018 workshop
    corecore