14 research outputs found
A Modern Primer on Processing in Memory
Modern computing systems are overwhelmingly designed to move data to
computation. This design choice goes directly against at least three key trends
in computing that cause performance, scalability and energy bottlenecks: (1)
data access is a key bottleneck as many important applications are increasingly
data-intensive, and memory bandwidth and energy do not scale well, (2) energy
consumption is a key limiter in almost all computing platforms, especially
server and mobile systems, (3) data movement, especially off-chip to on-chip,
is very expensive in terms of bandwidth, energy and latency, much more so than
computation. These trends are especially severely-felt in the data-intensive
server and energy-constrained mobile systems of today. At the same time,
conventional memory technology is facing many technology scaling challenges in
terms of reliability, energy, and performance. As a result, memory system
architects are open to organizing memory in different ways and making it more
intelligent, at the expense of higher cost. The emergence of 3D-stacked memory
plus logic, the adoption of error correcting codes inside the latest DRAM
chips, proliferation of different main memory standards and chips, specialized
for different purposes (e.g., graphics, low-power, high bandwidth, low
latency), and the necessity of designing new solutions to serious reliability
and security issues, such as the RowHammer phenomenon, are an evidence of this
trend. This chapter discusses recent research that aims to practically enable
computation close to data, an approach we call processing-in-memory (PIM). PIM
places computation mechanisms in or near where the data is stored (i.e., inside
the memory chips, in the logic layer of 3D-stacked memory, or in the memory
controllers), so that data movement between the computation units and memory is
reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398
Co-designing reliability and performance for datacenter memory
Memory is one of the key components that affects reliability and performance of datacenter servers. Memory in today’s servers is organized and shared in several ways to provide the most performant and efficient access to data. For example, cache hierarchy in multi-core chips to reduce access latency, non-uniform memory access (NUMA) in multi-socket servers to improve scalability,
disaggregation to increase memory capacity. In all these organizations, hardware coherence protocols are used to maintain memory consistency of this shared memory and implicitly move data to the requesting cores.
This thesis aims to provide fault-tolerance against newer models of failure in the organization of memory in datacenter servers. While designing for improved reliability, this thesis explores solutions that can also enhance performance of applications. The solutions build over modern coherence protocols to achieve these properties.
First, we observe that DRAM memory system failure rates have increased, demanding stronger forms of memory reliability. To combat this, the thesis proposes Dvé, a hardware driven replication mechanism where data blocks are replicated across two different memory controllers in a cache-coherent NUMA system. Data blocks are accompanied by a code with strong error detection capabilities so that when an error is detected, correction is performed using the replica. Dvé’s organization offers two independent points of access to data which enables: (a) strong error correction that can recover from a range of faults affecting any of the components in the memory and (b) higher performance by providing another nearer point of memory access. Dvé’s coherent replication keeps the replicas in sync for reliability and also provides coherent access to read replicas during fault-free operation for improved performance. Dvé can
flexibly provide these benefits on-demand at runtime.
Next, we observe that the coherence protocol itself requires to be hardened against failures. Memory in datacenter servers is being disaggregated from the compute servers into dedicated memory servers, driven by standards like CXL. CXL specifies the coherence protocol semantics for compute servers to access and cache data from a shared region in the disaggregated memory. However, the CXL specification lacks the requisite level of fault-tolerance necessary to operate at an inter-server scale within the datacenter. Compute servers can fail or be unresponsive in the datacenter and therefore, it is important that the coherence protocol remain available in the presence of such failures.
The thesis proposes Āpta, a CXL-based, shared disaggregated memory system for keeping the cached data consistent without compromising availability in the face of compute server failures. Āpta architects a high-performance fault-tolerant object-granular memory server that significantly improves performance for stateless function-as-a-service (FaaS) datacenter applications
AI/ML Algorithms and Applications in VLSI Design and Technology
An evident challenge ahead for the integrated circuit (IC) industry in the
nanometer regime is the investigation and development of methods that can
reduce the design complexity ensuing from growing process variations and
curtail the turnaround time of chip manufacturing. Conventional methodologies
employed for such tasks are largely manual; thus, time-consuming and
resource-intensive. In contrast, the unique learning strategies of artificial
intelligence (AI) provide numerous exciting automated approaches for handling
complex and data-intensive tasks in very-large-scale integration (VLSI) design
and testing. Employing AI and machine learning (ML) algorithms in VLSI design
and manufacturing reduces the time and effort for understanding and processing
the data within and across different abstraction levels via automated learning
algorithms. It, in turn, improves the IC yield and reduces the manufacturing
turnaround time. This paper thoroughly reviews the AI/ML automated approaches
introduced in the past towards VLSI design and manufacturing. Moreover, we
discuss the scope of AI/ML applications in the future at various abstraction
levels to revolutionize the field of VLSI design, aiming for high-speed, highly
intelligent, and efficient implementations
On the application of neural networks to symbol systems.
While for many years two alternative approaches to building intelligent systems, symbolic
AI and neural networks, have each demonstrated specific advantages and also revealed
specific weaknesses, in recent years a number of researchers have sought methods of combining
the two into a unified methodology which embodies the benefits of each while attenuating the
disadvantages.
This work sets out to identify the key ideas from each discipline and combine them
into an architecture which would be practically scalable for very large network applications.
The architecture is based on a relational database structure and forms the environment for an
investigation into the necessary properties of a symbol encoding which will permit the singlepresentation
learning of patterns and associations, the development of categories and features
leading to robust generalisation and the seamless integration of a range of memory persistencies
from short to long term.
It is argued that if, as proposed by many proponents of symbolic AI, the symbol encoding
must be causally related to its syntactic meaning, then it must also be mutable as the network
learns and grows, adapting to the growing complexity of the relationships in which it is
instantiated. Furthermore, it is argued that in order to create an efficient and coherent memory
structure, the symbolic encoding itself must have an underlying structure which is not accessible
symbolically; this structure would provide the framework permitting structurally sensitive processes
to act upon symbols without explicit reference to their content. Such a structure must dictate
how new symbols are created during normal operation.
The network implementation proposed is based on K-from-N codes, which are shown
to possess a number of desirable qualities and are well matched to the requirements of the symbol
encoding. Several networks are developed and analysed to exploit these codes, based around
a recurrent version of the non-holographic associati ve memory of Willshaw, et al. The simplest
network is shown to have properties similar to those of a Hopfield network, but the storage capacity
is shown to be greater, though at a cost of lower signal to noise ratio.
Subsequent network additions break each K-from-N pattern into L subsets, each using
D-from-N coding, creating cyclic patterns of period L. This step increases the capacity still further
but at a cost of lower signal to noise ratio. The use of the network in associating pairs of
input patterns with any given output pattern, an architectural requirement, is verified.
The use of complex synaptic junctions is investigated as a means to increase storage
capacity, to address the stability-plasticity dilemma and to implement the hierarchical aspects
of the symbol encoding defined in the architecture. A wide range of options is developed which
allow a number of key global parameters to be traded-off. One scheme is analysed and simulated.
A final section examines some of the elements that need to be added to our current understanding
of neural network-based reasoning systems to make general purpose intelligent systems
possible. It is argued that the sections of this work represent pieces of the whole in this
regard and that their integration will provide a sound basis for making such systems a reality
Computational Neural Learning Formalisms for Perceptual Manipulation: Singularity Interaction Dynamics Model.
This dissertation addresses a fundamental problem in computational AI--developing a class of massively parallel, neural algorithms for learning robustly, and in real-time, complex nonlinear transformations from representative exemplars. Provision of such a capability is at the core of many real-life problems in robotics, signal processing and control. The concepts of terminal attractors in dynamical systems theory and adjoint operators in nonlinear sensitivity theory are exploited to provide a firm mathematical foundation for learning such mappings with dynamical neural networks, while achieving a dramatic reduction in the overall computational costs. Further, we derive an efficient methodology for handling a multiplicity of application-specific constraints during run-time, that precludes additional retraining or disturbing the synaptic structure of the learned network. The scalability of proposed theoretical models to large-scale embodiments in neural hardware is analyzed. Neurodynamical parameters, e.g., decay constants, response gains, etc., are systematically analyzed to understand their implications on network scalability, convergence, throughput and fault tolerance, during both concurrent simulations and implementation in concurrently asynchronous VLSI, optical and opto-electronic hardware. Dynamical diagnostics, e.g., Lyapunov exponents, are used to formally characterize the widely observed dynamical instability in neural networks as emergent computational chaos . Using contracting operators and nonconstructive theorems from fixed point theory, we rigorously derive necessary and sufficient conditions for eliminating all oscillatory and chaotic behavior in additive-type networks. Extensive benchmarking experiments are conducted with arbitrarily large neural networks (over 100 million interconnects) to verify the methodological robustness of our network conditioning formalisms. Finally, we provide insight for exploiting our proposed repertoire of neural learning formalisms in addressing a fundamental problem in robotics--manipulation controller design for robots operating in unpredictable environments. Using some recent results in task analysis and dynamic modeling we develop the Perceptual Manipulation Architecture . The architecture, conceptualized within a perceptual framework, is shown to be well beyond the state-of-the-art model-directed robotics. For a stronger physical interpretation of its implications, our discussions are embedded in context of a novel systems\u27 concept for automated space operations