3,613 research outputs found
Near-Memory Address Translation
Memory and logic integration on the same chip is becoming increasingly cost
effective, creating the opportunity to offload data-intensive functionality to
processing units placed inside memory chips. The introduction of memory-side
processing units (MPUs) into conventional systems faces virtual memory as the
first big showstopper: without efficient hardware support for address
translation MPUs have highly limited applicability. Unfortunately, conventional
translation mechanisms fall short of providing fast translations as
contemporary memories exceed the reach of TLBs, making expensive page walks
common.
In this paper, we are the first to show that the historically important
flexibility to map any virtual page to any page frame is unnecessary in today's
servers. We find that while limiting the associativity of the
virtual-to-physical mapping incurs no penalty, it can break the
translate-then-fetch serialization if combined with careful data placement in
the MPU's memory, allowing for translation and data fetch to proceed
independently and in parallel. We propose the Distributed Inverted Page Table
(DIPTA), a near-memory structure in which the smallest memory partition keeps
the translation information for its data share, ensuring that the translation
completes together with the data fetch. DIPTA completely eliminates the
performance overhead of translation, achieving speedups of up to 3.81x and
2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
Brief Announcement: Persistent Software Combining
We study the performance power of software combining in designing recoverable algorithms and data structures. We present two recoverable synchronization protocols, one blocking and another wait-free, which illustrate how to use software combining to achieve both low persistence and synchronization cost. Our experiments show that these protocols outperform by far state-of-the-art recoverable universal constructions and transactional memory systems. We built recoverable queues and stacks, based on these protocols, that exhibit much better performance than previous such implementations
Coupled Depth Learning
In this paper we propose a method for estimating depth from a single image
using a coarse to fine approach. We argue that modeling the fine depth details
is easier after a coarse depth map has been computed. We express a global
(coarse) depth map of an image as a linear combination of a depth basis learned
from training examples. The depth basis captures spatial and statistical
regularities and reduces the problem of global depth estimation to the task of
predicting the input-specific coefficients in the linear combination. This is
formulated as a regression problem from a holistic representation of the image.
Crucially, the depth basis and the regression function are {\bf coupled} and
jointly optimized by our learning scheme. We demonstrate that this results in a
significant improvement in accuracy compared to direct regression of depth
pixel values or approaches learning the depth basis disjointly from the
regression function. The global depth estimate is then used as a guidance by a
local refinement method that introduces depth details that were not captured at
the global level. Experiments on the NYUv2 and KITTI datasets show that our
method outperforms the existing state-of-the-art at a considerably lower
computational cost for both training and testing.Comment: 10 pages, 3 Figures, 4 Tables with quantitative evaluation
Atomic-SDN: Is Synchronous Flooding the Solution to Software-Defined Networking in IoT?
The adoption of Software Defined Networking (SDN) within traditional networks
has provided operators the ability to manage diverse resources and easily
reconfigure networks as requirements change. Recent research has extended this
concept to IEEE 802.15.4 low-power wireless networks, which form a key
component of the Internet of Things (IoT). However, the multiple traffic
patterns necessary for SDN control makes it difficult to apply this approach to
these highly challenging environments. This paper presents Atomic-SDN, a highly
reliable and low-latency solution for SDN in low-power wireless. Atomic-SDN
introduces a novel Synchronous Flooding (SF) architecture capable of
dynamically configuring SF protocols to satisfy complex SDN control
requirements, and draws from the authors' previous experiences in the IEEE EWSN
Dependability Competition: where SF solutions have consistently outperformed
other entries. Using this approach, Atomic-SDN presents considerable
performance gains over other SDN implementations for low-power IoT networks. We
evaluate Atomic-SDN through simulation and experimentation, and show how
utilizing SF techniques provides latency and reliability guarantees to SDN
control operations as the local mesh scales. We compare Atomic-SDN against
other SDN implementations based on the IEEE 802.15.4 network stack, and
establish that Atomic-SDN improves SDN control by orders-of-magnitude across
latency, reliability, and energy-efficiency metrics
A New System Architecture for Heterogeneous Compute Units
The ongoing trend to more heterogeneous systems forces us to rethink the design of systems. In this work, I study a new system design that considers heterogeneous compute units (general-purpose cores with different instruction sets, DSPs, FPGAs, fixed-function accelerators, etc.) from the beginning instead of as an afterthought. The goal is to treat all compute units (CUs) as first-class citizens, enabling (1) isolation and secure communication between all types of CUs, (2) a direct interaction of all CUs, removing the conventional CPU from the critical path, and (3) access to operating system (OS) services such as file systems and network stacks for all CUs.
To study this system design, I am using a hardware/software co-design based on two key ideas: 1) introduce a new hardware component next to each CU used by the OS as the CUs' common interface and 2) let the OS kernel control applications remotely from a different CU. The hardware component is called data transfer unit (DTU) and offers the minimal set of features to reach the stated goals: secure message passing and memory access. The OS is called M³ and runs its kernel on a dedicated CU and runs the OS services and applications on the remaining CUs. The kernel is responsible for establishing DTU-based communication channels between services and applications. After a channel has been set up, services and applications communicate directly without involving
the kernel. This approach allows to support arbitrary CUs as aforementioned first-class citizens, ranging from fixed-function accelerators to complex general-purpose cores
- …