3,693 research outputs found
A similarity criterion for sequential programs using truth-preserving partial functions
The execution of sequential programs allows them to be represented using
mathematical functions formed by the composition of statements following one
after the other. Each such statement is in itself a partial function, which
allows only inputs satisfying a particular Boolean condition to carry forward
the execution and hence, the composition of such functions (as a result of
sequential execution of the statements) strengthens the valid set of input
state variables for the program to complete its execution and halt succesfully.
With this thought in mind, this paper tries to study a particular class of
partial functions, which tend to preserve the truth of two given Boolean
conditions whenever the state variables satisfying one are mapped through such
functions into a domain of state variables satisfying the other. The existence
of such maps allows us to study isomorphism between different programs, based
not only on their structural characteristics (e.g. the kind of programming
constructs used and the overall input-output transformation), but also the
nature of computation performed on seemingly different inputs. Consequently, we
can now relate programs which perform a given type of computation, like a loop
counting down indefinitely, without caring about the input sets they work on
individually or the set of statements each program contains.Comment: Submitted as term paper in 201
ISA Mapper: A Compute and Hardware Agnostic Deep Learning Compiler
Domain specific accelerators present new challenges and opportunities for
code generation onto novel instruction sets, communication fabrics, and memory
architectures.
In this paper we introduce an intermediate representation (IR) which enables
both deep learning computational kernels and hardware capabilities to be
described in the same IR. We then formulate and apply instruction mapping to
determine the possible ways a computation can be performed on a hardware
system. Next, our scheduler chooses a specific mapping and determines the data
movement and computation order. In order to manage the large search space of
mappings and schedules, we developed a flexible framework that allows
heuristics, cost models, and potentially machine learning to facilitate this
search problem.
With this system, we demonstrate the automated extraction of matrix
multiplication kernels out of recent deep learning kernels such as
depthwise-separable convolution. In addition, we demonstrate two to five times
better performance on DeepBench sized GEMMs and GRU RNN execution when compared
to state-of-the-art (SOTA) implementations on new hardware and up to 85% of the
performance for SOTA implementations on existing hardware
Vsep-New Heuristic and Exact Algorithms for Graph Automorphism Group Computation
One exact and two heuristic algorithms for determining the generators, orbits
and order of the graph automorphism group are presented. A basic tool of these
algorithms is the well-known individualization and refinement procedure. A
search tree is used in the algorithms - each node of the tree is a partition.
All nonequivalent discreet partitions derivative of the selected vertices are
stored in a coded form. A new strategy is used in the exact algorithm: if
during its execution some of the searched or intermediate variables obtain a
wrong value then the algorithm continues from a new start point losing some of
the results determined so far. The algorithms has been tested on one of the
known benchmark graphs and shows lower running times for some graph families.
The heuristic versions of the algorithms are based on determining some number
of discreet partitions derivative of each vertex in the selected cell of the
initial partition and comparing them for an automorphism - their search trees
are reduced. The heuristic algorithms are almost exact and are many times
faster than the exact one. The experimental tests exhibit that the worst-cases
running time of the exact algorithm is exponential but it is polynomial for the
heuristic algorithms. Several cell selectors are used. Some of them are new. We
also use a chooser of cell selector for choosing the optimal cell selector for
the manipulated graph. The proposed heuristic algorithms use two main heuristic
procedures that generate two different forests of search trees.Comment: 47 pages; 1. Entirely revised 2. Algorithms analysis removed 3. New
algorithm versions added, one version removed 4. Changed algorithm COMP -
cases CS2/CS4 are solved in a new wa
DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators
The convolutional neural network (CNN) has become a state-of-the-art method
for several artificial intelligence domains in recent years. The increasingly
complex CNN models are both computation-bound and I/O-bound. FPGA-based
accelerators driven by custom instruction set architecture (ISA) achieve a
balance between generality and efficiency, but there is much on them left to be
optimized. We propose the full-stack compiler DNNVM, which is an integration of
optimizers for graphs, loops and data layouts, and an assembler, a runtime
supporter and a validation environment. The DNNVM works in the context of deep
learning frameworks and transforms CNN models into the directed acyclic graph:
XGraph. Based on XGraph, we transform the optimization challenges for both the
data layout and pipeline into graph-level problems. DNNVM enumerates all
potentially profitable fusion opportunities by a heuristic subgraph isomorphism
algorithm to leverage pipeline and data layout optimizations, and searches for
the best choice of execution strategies of the whole computing graph. On the
Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art
performance on our benchmarks by na\"ive implementations without optimizations,
and the throughput is further improved up to 1.26x by leveraging heterogeneous
optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art
performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an
energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38
TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.Comment: 18 pages, 9 figures, 5 table
Automatic Library Version Identification, an Exploration of Techniques
This paper is the result of a two month research internship on the topic of
library version identification. In this paper, ideas and techniques from
literature in the area of binary comparison and fingerprinting are outlined and
applied to the problem of (version) identification of shared libraries and of
libraries within statically linked binary executables. Six comparison
techniques are chosen and implemented in an open-source tool which in turn
makes use of the open-source radare2 framework for signature generation. The
effectiveness of the techniques is empirically analyzed by comparing both
artificial and real sample files against a reference dataset of multiple
versions of dozens of libraries. The results show that out of these techniques,
readable string--based techniques perform the best and that one of these
techniques correctly identifies multiple libraries contained in a stripped
statically linked executable file.Comment: 9 pages, short technical repor
Design space exploration tools for the ByoRISC configurable processor family
In this paper, the ByoRISC (Build your own RISC) configurable
application-specific instruction-set processor (ASIP) family is presented.
ByoRISCs, as vendor-independent cores, provide extensive architectural
parameters over a baseline processor, which can be customized by
application-specific hardware extensions (ASHEs). Such extensions realize
multi-input multi-output (MIMO) custom instructions with local state and
load/store accesses to the data memory. ByoRISCs incorporate a true multi-port
register file, zero-overhead custom instruction decoding, and scalable data
forwarding mechanisms. Given these design decisions, ByoRISCs provide a unique
combination of features that allow their use as architectural testbeds and the
seamless and rapid development of new high-performance ASIPs.
The performance characteristics of ByoRISCs, implemented as
vendor-independent cores, have been evaluated for both ASIC and FPGA
implementations, and it is proved that they provide a viable solution in
FPGA-based system-on-a-chip design. A case study of an image processing
pipeline is also presented to highlight the process of utilizing a ByoRISC
custom processor. A peak performance speedup of up to 8.5 can be
observed, whereas an average performance speedup of 4.4 on Xilinx
Virtex-4 targets is achieved. In addition, ByoRISC outperforms an experimental
VLIW architecture named VEX even in its 16-wide configuration for a number of
data-intensive application kernels.Comment: 12 pages, 14 figures, 7 tables. Unpublished paper on ByoRISC, an
extensible RISC with MIMO CIs that can outperform most mid-range VLIWs.
Unfortunately Prof. Jorg Henkel destroyed the potential of this submission by
using immoral tactics (neglecting his conflict of interest, changing
reviewers accepting the paper, and requesting impossible additions for the
average lifetime of an Earthlin
Deep Learning Based Cryptographic Primitive Classification
Cryptovirological augmentations present an immediate, incomparable threat.
Over the last decade, the substantial proliferation of crypto-ransomware has
had widespread consequences for consumers and organisations alike. Established
preventive measures perform well, however, the problem has not ceased. Reverse
engineering potentially malicious software is a cumbersome task due to platform
eccentricities and obfuscated transmutation mechanisms, hence requiring
smarter, more efficient detection strategies. The following manuscript presents
a novel approach for the classification of cryptographic primitives in compiled
binary executables using deep learning. The model blueprint, a DCNN, is
fittingly configured to learn from variable-length control flow diagnostics
output from a dynamic trace. To rival the size and variability of contemporary
data compendiums, hence feeding the model cognition, a methodology for the
procedural generation of synthetic cryptographic binaries is defined, utilising
core primitives from OpenSSL with multivariate obfuscation, to draw a vastly
scalable distribution. The library, CryptoKnight, rendered an algorithmic pool
of AES, RC4, Blowfish, MD5 and RSA to synthesis combinable variants which are
automatically fed in its core model. Converging at 91% accuracy, CryptoKnight
is successfully able to classify the sample algorithms with minimal loss.Comment: 9 Pages, 6 Figure
The Power of Distributed Verifiers in Interactive Proofs
We explore the power of interactive proofs with a distributed verifier. In
this setting, the verifier consists of nodes and a graph that defines
their communication pattern. The prover is a single entity that communicates
with all nodes by short messages. The goal is to verify that the graph
belongs to some language in a small number of rounds, and with small
communication bound, i.e., the proof size.
This interactive model was introduced by Kol, Oshman and Saxena (PODC 2018)
as a generalization of non-interactive distributed proofs. They demonstrated
the power of interaction in this setting by constructing protocols for problems
as Graph Symmetry and Graph Non-Isomorphism -- both of which require proofs of
-bits without interaction.
In this work, we provide a new general framework for distributed interactive
proofs that allows one to translate standard interactive protocols to ones
where the verifier is distributed with short proof size. We show the following:
* Every (centralized) computation that can be performed in time can be
translated into three-round distributed interactive protocol with
proof size. This implies that many graph problems for sparse graphs have
succinct proofs.
* Every (centralized) computation implemented by either a small space or by
uniform NC circuit can be translated into a distributed protocol with
rounds and bits proof size for the low space case and
many rounds and proof size for NC.
* We show that for Graph Non-Isomorphism, there is a 4-round protocol with
proof size, improving upon the proof size of Kol et
al.
* For many problems we show how to reduce proof size below the naturally
seeming barrier of . We get a 5-round protocols with proof size for a family of problems
IFC Inside: Retrofitting Languages with Dynamic Information Flow Control (Extended Version)
Many important security problems in JavaScript, such as browser extension
security, untrusted JavaScript libraries and safe integration of mutually
distrustful websites (mash-ups), may be effectively addressed using an
efficient implementation of information flow control (IFC). Unfortunately
existing fine-grained approaches to JavaScript IFC require modifications to the
language semantics and its engine, a non-goal for browser applications. In this
work, we take the ideas of coarse-grained dynamic IFC and provide the
theoretical foundation for a language-based approach that can be applied to any
programming language for which external effects can be controlled. We then
apply this formalism to server- and client-side JavaScript, show how it
generalizes to the C programming language, and connect it to the Haskell LIO
system. Our methodology offers design principles for the construction of
information flow control systems when isolation can easily be achieved, as well
as compositional proofs for optimized concrete implementations of these
systems, by relating them to their isolated variants.Comment: Extended version of POST'15 paper; 31 page
Implementing distributed {\lambda}-calculus interpreter
This paper describes how one can implement distributed {\lambda}-calculus
interpreter from scratch. At first, we describe how to implement a monadic
parser, than the Krivine Machine is introduced for the interpretation part and
as for distribution, the actor model is used. In this work we are not providing
general solution for parallelism, but we consider particular patterns, which
always can be parallelized. As a result, the basic extensible implementation of
call-by-name distributed machine is introduced and prototype is presented. We
achieved computation speed improvement in some cases, but efficient distributed
version is not achieved, problems are discussed in evaluation section. This
work provides a foundation for further research, completing the implementation
it is possible to add concurrency for non-determinism, improve the interpreter
using call-by-need semantic or study optimal auto parallelization to generalize
what could be done efficiently in parallel.Comment: 8 pages, 4 tables, 1 figure, proceeding AINA-2018 workshop
- …