54,485 research outputs found
GLB: Lifeline-based Global Load Balancing library in X10
We present GLB, a programming model and an associated implementation that can
handle a wide range of irregular paral- lel programming problems running over
large-scale distributed systems. GLB is applicable both to problems that are
easily load-balanced via static scheduling and to problems that are hard to
statically load balance. GLB hides the intricate syn- chronizations (e.g.,
inter-node communication, initialization and startup, load balancing,
termination and result collection) from the users. GLB internally uses a
version of the lifeline graph based work-stealing algorithm proposed by
Saraswat et al. Users of GLB are simply required to write several pieces of
sequential code that comply with the GLB interface. GLB then schedules and
orchestrates the parallel execution of the code correctly and efficiently at
scale. We have applied GLB to two representative benchmarks: Betweenness
Centrality (BC) and Unbalanced Tree Search (UTS). Among them, BC can be
statically load-balanced whereas UTS cannot. In either case, GLB scales well--
achieving nearly linear speedup on different computer architectures (Power,
Blue Gene/Q, and K) -- up to 16K cores
Synaptic nanomodules underlie the organization and plasticity of spine synapses.
Experience results in long-lasting changes in dendritic spine size, yet how the molecular architecture of the synapse responds to plasticity remains poorly understood. Here a combined approach of multicolor stimulated emission depletion microscopy (STED) and confocal imaging in rat and mouse demonstrates that structural plasticity is linked to the addition of unitary synaptic nanomodules to spines. Spine synapses in vivo and in vitro contain discrete and aligned subdiffraction modules of pre- and postsynaptic proteins whose number scales linearly with spine size. Live-cell time-lapse super-resolution imaging reveals that NMDA receptor-dependent increases in spine size are accompanied both by enhanced mobility of pre- and postsynaptic modules that remain aligned with each other and by a coordinated increase in the number of nanomodules. These findings suggest a simplified model for experience-dependent structural plasticity relying on an unexpectedly modular nanomolecular architecture of synaptic proteins
8x8 Reconfigurable quantum photonic processor based on silicon nitride waveguides
The development of large-scale optical quantum information processing
circuits ground on the stability and reconfigurability enabled by integrated
photonics. We demonstrate a reconfigurable 8x8 integrated linear optical
network based on silicon nitride waveguides for quantum information processing.
Our processor implements a novel optical architecture enabling any arbitrary
linear transformation and constitutes the largest programmable circuit reported
so far on this platform. We validate a variety of photonic quantum information
processing primitives, in the form of Hong-Ou-Mandel interference, bosonic
coalescence/anticoalescence and high-dimensional single-photon quantum gates.
We achieve fidelities that clearly demonstrate the promising future for
large-scale photonic quantum information processing using low-loss silicon
nitride.Comment: Added supplementary materials, extended introduction, new figures,
results unchange
Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs
Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft
Deep Attributes Driven Multi-Camera Person Re-identification
The visual appearance of a person is easily affected by many factors like
pose variations, viewpoint changes and camera parameter differences. This makes
person Re-Identification (ReID) among multiple cameras a very challenging task.
This work is motivated to learn mid-level human attributes which are robust to
such visual appearance variations. And we propose a semi-supervised attribute
learning framework which progressively boosts the accuracy of attributes only
using a limited number of labeled data. Specifically, this framework involves a
three-stage training. A deep Convolutional Neural Network (dCNN) is first
trained on an independent dataset labeled with attributes. Then it is
fine-tuned on another dataset only labeled with person IDs using our defined
triplet loss. Finally, the updated dCNN predicts attribute labels for the
target dataset, which is combined with the independent dataset for the final
round of fine-tuning. The predicted attributes, namely \emph{deep attributes}
exhibit superior generalization ability across different datasets. By directly
using the deep attributes with simple Cosine distance, we have obtained
surprisingly good accuracy on four person ReID datasets. Experiments also show
that a simple metric learning modular further boosts our method, making it
significantly outperform many recent works.Comment: Person Re-identification; 17 pages; 5 figures; In IEEE ECCV 201
A Low Cost Two-Tier Architecture Model For High Availability Clusters Application Load Balancing
This article proposes a design and implementation of a low cost two-tier
architecture model for high availability cluster combined with load-balancing
and shared storage technology to achieve desired scale of three-tier
architecture for application load balancing e.g. web servers. The research work
proposes a design that physically omits Network File System (NFS) server nodes
and implements NFS server functionalities within the cluster nodes, through Red
Hat Cluster Suite (RHCS) with High Availability (HA) proxy load balancing
technologies. In order to achieve a low-cost implementation in terms of
investment in hardware and computing solutions, the proposed architecture will
be beneficial. This system intends to provide steady service despite any system
components fails due to uncertainly such as network system, storage and
applications.Comment: Load balancing, high availability cluster, web server cluster
FastDepth: Fast Monocular Depth Estimation on Embedded Systems
Depth sensing is a critical function for robotic tasks such as localization,
mapping and obstacle detection. There has been a significant and growing
interest in depth estimation from a single RGB image, due to the relatively low
cost and size of monocular cameras. However, state-of-the-art single-view depth
estimation algorithms are based on fairly complex deep neural networks that are
too slow for real-time inference on an embedded platform, for instance, mounted
on a micro aerial vehicle. In this paper, we address the problem of fast depth
estimation on embedded systems. We propose an efficient and lightweight
encoder-decoder network architecture and apply network pruning to further
reduce computational complexity and latency. In particular, we focus on the
design of a low-latency decoder. Our methodology demonstrates that it is
possible to achieve similar accuracy as prior work on depth estimation, but at
inference speeds that are an order of magnitude faster. Our proposed network,
FastDepth, runs at 178 fps on an NVIDIA Jetson TX2 GPU and at 27 fps when using
only the TX2 CPU, with active power consumption under 10 W. FastDepth achieves
close to state-of-the-art accuracy on the NYU Depth v2 dataset. To the best of
the authors' knowledge, this paper demonstrates real-time monocular depth
estimation using a deep neural network with the lowest latency and highest
throughput on an embedded platform that can be carried by a micro aerial
vehicle.Comment: Accepted for presentation at ICRA 2019. 8 pages, 6 figures, 7 table
- …