8,287 research outputs found
A scalable multi-core architecture with heterogeneous memory structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs)
Neuromorphic computing systems comprise networks of neurons that use
asynchronous events for both computation and communication. This type of
representation offers several advantages in terms of bandwidth and power
consumption in neuromorphic electronic systems. However, managing the traffic
of asynchronous events in large scale systems is a daunting task, both in terms
of circuit complexity and memory requirements. Here we present a novel routing
methodology that employs both hierarchical and mesh routing strategies and
combines heterogeneous memory structures for minimizing both memory
requirements and latency, while maximizing programming flexibility to support a
wide range of event-based neural network architectures, through parameter
configuration. We validated the proposed scheme in a prototype multi-core
neuromorphic processor chip that employs hybrid analog/digital circuits for
emulating synapse and neuron dynamics together with asynchronous digital
circuits for managing the address-event traffic. We present a theoretical
analysis of the proposed connectivity scheme, describe the methods and circuits
used to implement such scheme, and characterize the prototype chip. Finally, we
demonstrate the use of the neuromorphic processor with a convolutional neural
network for the real-time classification of visual symbols being flashed to a
dynamic vision sensor (DVS) at high speed.Comment: 17 pages, 14 figure
Using quantum key distribution for cryptographic purposes: a survey
The appealing feature of quantum key distribution (QKD), from a cryptographic
viewpoint, is the ability to prove the information-theoretic security (ITS) of
the established keys. As a key establishment primitive, QKD however does not
provide a standalone security service in its own: the secret keys established
by QKD are in general then used by a subsequent cryptographic applications for
which the requirements, the context of use and the security properties can
vary. It is therefore important, in the perspective of integrating QKD in
security infrastructures, to analyze how QKD can be combined with other
cryptographic primitives. The purpose of this survey article, which is mostly
centered on European research results, is to contribute to such an analysis. We
first review and compare the properties of the existing key establishment
techniques, QKD being one of them. We then study more specifically two generic
scenarios related to the practical use of QKD in cryptographic infrastructures:
1) using QKD as a key renewal technique for a symmetric cipher over a
point-to-point link; 2) using QKD in a network containing many users with the
objective of offering any-to-any key establishment service. We discuss the
constraints as well as the potential interest of using QKD in these contexts.
We finally give an overview of challenges relative to the development of QKD
technology that also constitute potential avenues for cryptographic research.Comment: Revised version of the SECOQC White Paper. Published in the special
issue on QKD of TCS, Theoretical Computer Science (2014), pp. 62-8
Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P.
The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft
Scalable Persistent Storage for Erlang
The many core revolution makes scalability a key property. The RELEASE project aims to improve the scalability of Erlang on emergent commodity architectures with 100,000 cores. Such architectures require scalable and available persistent storage on up to 100 hosts. We enumerate the requirements for scalable and available persistent storage, and evaluate four popular Erlang DBMSs against these requirements. This analysis shows that Mnesia and CouchDB are not suitable persistent storage at our target scale, but Dynamo-like NoSQL DataBase Management Systems (DBMSs) such as Cassandra and Riak potentially are. We investigate the current scalability limits of the Riak 1.1.1 NoSQL DBMS in practice on a 100-node cluster. We establish for the first time scientifically the scalability limit of Riak as 60 nodes on the Kalkyl cluster, thereby confirming developer folklore. We show that resources like memory, disk, and network do not limit the scalability of Riak. By instrumenting Erlang/OTP and Riak libraries we identify a specific Riak functionality that limits scalability. We outline how later releases of Riak are refactored to eliminate the scalability bottlenecks. We conclude that Dynamo-style NoSQL DBMSs provide scalable and available persistent storage for Erlang in general, and for our RELEASE target architecture in particular
Deep supervised learning using local errors
Error backpropagation is a highly effective mechanism for learning
high-quality hierarchical features in deep networks. Updating the features or
weights in one layer, however, requires waiting for the propagation of error
signals from higher layers. Learning using delayed and non-local errors makes
it hard to reconcile backpropagation with the learning mechanisms observed in
biological neural networks as it requires the neurons to maintain a memory of
the input long enough until the higher-layer errors arrive. In this paper, we
propose an alternative learning mechanism where errors are generated locally in
each layer using fixed, random auxiliary classifiers. Lower layers could thus
be trained independently of higher layers and training could either proceed
layer by layer, or simultaneously in all layers using local error information.
We address biological plausibility concerns such as weight symmetry
requirements and show that the proposed learning mechanism based on fixed,
broad, and random tuning of each neuron to the classification categories
outperforms the biologically-motivated feedback alignment learning technique on
the MNIST, CIFAR10, and SVHN datasets, approaching the performance of standard
backpropagation. Our approach highlights a potential biological mechanism for
the supervised, or task-dependent, learning of feature hierarchies. In
addition, we show that it is well suited for learning deep networks in custom
hardware where it can drastically reduce memory traffic and data communication
overheads
Check-Agnosia based Post-Processor for Message-Passing Decoding of Quantum LDPC Codes
The inherent degeneracy of quantum low-density parity-check codes poses a
challenge to their decoding, as it significantly degrades the error-correction
performance of classical message-passing decoders. To improve their
performance, a post-processing algorithm is usually employed. To narrow the gap
between algorithmic solutions and hardware limitations, we introduce a new
post-processing algorithm with a hardware-friendly orientation, providing error
correction performance competitive to the state-of-the-art techniques. The
proposed post-processing, referred to as check-agnosia, is inspired by
stabilizer-inactivation, while considerably reducing the required hardware
resources, and providing enough flexibility to allow different message-passing
schedules and hardware architectures. We carry out a detailed analysis for a
set of Pareto architectures with different tradeoffs between latency and power
consumption, derived from the results of implemented designs on an FPGA board.
We show that latency values close to one microsecond can be obtained on the
FPGA board, and provide evidence that much lower latency values can be obtained
for ASIC implementations. In the process, we also demonstrate the practical
implications of the recently introduced t-covering layers and random-order
layered scheduling
Energy-Efficient FPGA-Based Parallel Quasi-Stochastic Computing
The high performance of FPGA (Field Programmable Gate Array) in image processing applications is justified by its flexible reconfigurability, its inherent parallel nature and the availability of a large amount of internal memories. Lately, the Stochastic Computing (SC) paradigm has been found to be significantly advantageous in certain application domains including image processing because of its lower hardware complexity and power consumption. However, its viability is deemed to be limited due to its serial bitstream processing and excessive run-time requirement for convergence. To address these issues, a novel approach is proposed in this work where an energy-efficient implementation of SC is accomplished by introducing fast-converging Quasi-Stochastic Number Generators (QSNGs) and parallel stochastic bitstream processing, which are well suited to leverage FPGA\u27s reconfigurability and abundant internal memory resources. The proposed approach has been tested on the Virtex-4 FPGA, and results have been compared with the serial and parallel implementations of conventional stochastic computation using the well-known SC edge detection and multiplication circuits. Results prove that by using this approach, execution time, as well as the power consumption are decreased by a factor of 3.5 and 4.5 for the edge detection circuit and multiplication circuit, respectively
- …