4,222 research outputs found
Kinetic AGN Feedback Effects on Cluster Cool Cores Simulated using SPH
We implement novel numerical models of AGN feedback in the SPH code GADGET-3,
where the energy from a supermassive black hole (BH) is coupled to the
surrounding gas in the kinetic form. Gas particles lying inside a bi-conical
volume around the BH are imparted a one-time velocity (10,000 km/s) increment.
We perform hydrodynamical simulations of isolated cluster (total mass 10^14 /h
M_sun), which is initially evolved to form a dense cool core, having central
T<10^6 K. A BH resides at the cluster center, and ejects energy. The
feedback-driven fast wind undergoes shock with the slower-moving gas, which
causes the imparted kinetic energy to be thermalized. Bipolar bubble-like
outflows form propagating radially outward to a distance of a few 100 kpc. The
radial profiles of median gas properties are influenced by BH feedback in the
inner regions (r<20-50 kpc). BH kinetic feedback, with a large value of the
feedback efficiency, depletes the inner cool gas and reduces the hot gas
content, such that the initial cool core of the cluster is heated up within a
time 1.9 Gyr, whereby the core median temperature rises to above 10^7 K, and
the central entropy flattens. Our implementation of BH thermal feedback (using
the same efficiency as kinetic), within the star-formation model, cannot do
this heating, where the cool core remains. The inclusion of cold gas accretion
in the simulations produces naturally a duty cycle of the AGN with a
periodicity of 100 Myr.Comment: 22 pages, 11 figures, version accepted for publication in MNRAS,
references and minor revisions adde
Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era
Around 2003, newly activated power constraints caused single-thread performance growth to slow dramatically. The multi-core era was born with an emphasis on explicitly parallel software. Continuing to grow single-thread performance is still important in the multi-core context, but it must be done in an energy efficient way.
One significant impediment to performance growth in both out-of-order and in-order processors is the long latency of last-level cache misses. Prior work introduced the idea of load latency tolerance---the ability to dynamically remove miss-dependent instructions from critical execution structures, continue execution under the miss, and re-execute miss-dependent instructions after the miss returns. However, previously proposed designs were unable to improve performance in an energy-efficient way---they introduced too many new large, complex structures and re-executed too many instructions.
This dissertation describes a new load latency tolerant design that is both energy-efficient, and applicable to both in-order and out-of-order cores. Key novel features include formulation of slice re-execution as an alternative use of multi-threading support, efficient schemes for register and memory state management, and new pruning mechanisms for drastically reducing load latency tolerance\u27s dynamic execution overheads.
Area analysis shows that energy-efficient load latency tolerance increases the footprint of an out-of-order core by a few percent, while cycle-level simulation shows that it significantly improves the performance of memory-bound programs. Energy-efficient load latency tolerance is more energy-efficient than---and synergistic with---existing performance technique like dynamic voltage and frequency scaling (DVFS)
A multi-viewpoint feature-based re-identification system driven by skeleton keypoints
Thanks to the increasing popularity of 3D sensors, robotic vision has experienced huge improvements in a wide range of applications and systems in the last years. Besides the many benefits, this migration caused some incompatibilities with those systems that cannot be based on range sensors, like intelligent video surveillance systems, since the two kinds of sensor data lead to different representations of people and objects. This work goes in the direction of bridging the gap, and presents a novel re-identification system that takes advantage of multiple video flows in order to enhance the performance of a skeletal tracking algorithm, which is in turn exploited for driving the re-identification. A new, geometry-based method for joining together the detections provided by the skeletal tracker from multiple video flows is introduced, which is capable of dealing with many people in the scene, coping with the errors introduced in each view by the skeletal tracker. Such method has a high degree of generality, and can be applied to any kind of body pose estimation algorithm. The system was tested on a public dataset for video surveillance applications, demonstrating the improvements achieved by the multi-viewpoint approach in the accuracy of both body pose estimation and re-identification. The proposed approach was also compared with a skeletal tracking system working on 3D data: the comparison assessed the good performance level of the multi-viewpoint approach. This means that the lack of the rich information provided by 3D sensors can be compensated by the availability of more than one viewpoint
Recommended from our members
Scalable hardware memory disambiguation
This dissertation deals with one of the long-standing problems in Computer Architecture
– the problem of memory disambiguation. Microprocessors typically reorder
memory instructions during execution to improve concurrency. Such microprocessors
use hardware memory structures for memory disambiguation, known as LoadStore
Queues (LSQs), to ensure that memory instruction dependences are satisfied
even when the memory instructions execute out-of-order. A typical LSQ implementation
(circa 2006) holds all in-flight memory instructions in a physically centralized
LSQ and performs a fully associative search on all buffered instructions to ensure
that memory dependences are satisfied. These LSQ implementations do not scale
because they use large, fully associative structures, which are known to be slow and
power hungry. The increasing trend towards distributed microarchitectures further
exacerbates these problems. As on-chip wire delays increase and high-performance
processors become necessarily distributed, centralized structures such as the LSQ
can limit scalability.
This dissertation describes techniques to create scalable LSQs in both centralized
and distributed microarchitectures. The problems and solutions described
in this thesis are motivated and validated by real system designs. The dissertation
starts with a description of the partitioned primary memory system of the TRIPS
processor, of which the LSQ is an important component, and then through a series
of optimizations describes how the power, area, and centralization problems
of the LSQ can be solved with minor performance losses (if at all) even for large
number of in flight memory instructions. The four solutions described in this dissertation
— partitioning, filtering, late binding and efficient overflow management —
enable power-, area-efficient, distributed and scalable LSQs, which in turn enable
aggressive large-window processors capable of simultaneously executing thousands
of instructions.
To mitigate the power problem, we replaced the power-hungry, fully associative
search with a power-efficient hash table lookup using a simple address-based
Bloom filter. Bloom filters are probabilistic data structures used for testing set
membership and can be used to quickly check if an instruction with the same data
address is likely to be found in the LSQ without performing the associative search.
Bloom filters typically eliminate more than 80% of the associative searches and they
are highly effective because in most programs, it is uncommon for loads and stores
to have the same data address and be in execution simultaneously.
To rectify the area problem, we observe the fact that only a small fraction
of all memory instructions are dependent, that only such dependent instructions
need to be buffered in the LSQ, and that these instructions need to be in the LSQ
only for certain parts of the pipelined execution. We propose two mechanisms to
exploit these observations. The first mechanism, area filtering, is a hardware mechanism
that couples Bloom filters and dependence predictors to dynamically identify
and buffer only those instructions which are likely to be dependent. The second
mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of
these optimizations allows the number of LSQ slots to be reduced by up to one-half
compared to a traditional organization without any performance degradation.
Finally, we describe a new decentralized LSQ design for handling LSQ structural
hazards in distributed microarchitectures. Decentralization of LSQs, and to
a large extent distributed microarchitectures with memory speculation, has proved
to be impractical because of the high performance penalties associated with the
mechanisms for dealing with hazards. To solve this problem, we applied classic
flow-control techniques from interconnection networks for handling resource con-
flicts. The first method, memory-side buffering, buffers the overflowing instructions
in a separate buffer near the LSQs. The second scheme, execution-side NACKing,
sends the overflowing instruction back to the issue window from which it is later
re-issued. The third scheme, network buffering, uses the buffers in the interconnection
network between the execution units and memory to hold instructions when the
LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network
buffering scheme is the most robust of all the overflow schemes and shows less than
1% performance degradation due to overflows for a subset of SPEC CPU 2000 and
EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS
processor.
The techniques proposed in this dissertation are independent, architectureneutral
and their cumulative benefits result in LSQs that can be partitioned at a
fine granularity and have low design complexity. Each of these partitions selectively
buffers only memory instructions with true dependences and can be closely coupled
with the execution units thus minimizing power, area, and latency. Such LSQ
designs with near-ideal characteristics are well suited for microarchitectures with
thousands of instructions in-flight and may enable even more aggressive microarchitectures
in the future.Computer Science
A High-Speed Range-Matching TCAM for Storage-Efficient Packet Classification
Abstract—A critical issue in the use of TCAMs for packet
classification is how to efficiently represent rules with ranges,
known as range matching. A range-matching ternary content
addressable memory (RM-TCAM) including a highly functional
range-matching cell (RMC) is presented in this paper. By offering
various range operators, the RM-TCAM can reduce storage
expansion ratio from 4.21 to 1.01 compared with conventional
TCAMs, under real-world packet classification rule sets, which
results in reduced power consumption and die area. A new pre-discharging
match-line scheme is used to realize high-speed searching
in a dynamic match-line structure. An additional charge-recycling
driver further reduces the power consumption of search lines.
Simulation results of a 256 64-bit range-matching TCAM, when
implemented in the 0.13- m CMOS technology, achieves a 1.99-ns
search time with an energy efficiency of 1.26 fJ/bit/search. While a
TCAM including range encoding approach requires an additional
SRAM or DRAM, the RM-TCAM can improve storage efficiency
without any extra components as well as reduce the die area
Theory and Implementation of RF-Input Outphasing Power Amplification
Conventional outphasing power amplifier systems require both a radio frequency (RF) carrier input and a separate baseband input to synthesize a modulated RF output. This work presents an RF-input/RF-output outphasing power amplifier that directly amplifies a modulated RF input, eliminating the need for multiple costly IQ modulators and baseband signal component separation as in previous outphasing systems. An RF signal decomposition network directly synthesizes the phase- and amplitude-modulated signals used to drive the branch power amplifiers (PAs). With this approach, a modulated RF signal including zero-crossings can be applied to the single RF input port of the outphasing RF amplifier system. The proposed technique is demonstrated at 2.14 GHz in a four-way lossless outphasing amplifier with transmission-line power combiner. The RF decomposition network is implemented using a transmission-line resistance compression network with nonlinear loads designed to provide the necessary amplitude and phase decomposition. The resulting proof-of-concept outphasing power amplifier has a peak CW output power of 93 W, peak drain efficiency of 70%, and performance on par with a previously-demonstrated outphasing and power combining system requiring four IQ modulators and a digital signal component separator
Fast Multi-frame Stereo Scene Flow with Motion Segmentation
We propose a new multi-frame method for efficiently computing scene flow
(dense depth and optical flow) and camera ego-motion for a dynamic scene
observed from a moving stereo camera rig. Our technique also segments out
moving objects from the rigid scene. In our method, we first estimate the
disparity map and the 6-DOF camera motion using stereo matching and visual
odometry. We then identify regions inconsistent with the estimated camera
motion and compute per-pixel optical flow only at these regions. This flow
proposal is fused with the camera motion-based flow proposal using fusion moves
to obtain the final optical flow and motion segmentation. This unified
framework benefits all four tasks - stereo, optical flow, visual odometry and
motion segmentation leading to overall higher accuracy and efficiency. Our
method is currently ranked third on the KITTI 2015 scene flow benchmark.
Furthermore, our CPU implementation runs in 2-3 seconds per frame which is 1-3
orders of magnitude faster than the top six methods. We also report a thorough
evaluation on challenging Sintel sequences with fast camera and object motion,
where our method consistently outperforms OSF [Menze and Geiger, 2015], which
is currently ranked second on the KITTI benchmark.Comment: 15 pages. To appear at IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2017). Our results were submitted to KITTI 2015 Stereo
Scene Flow Benchmark in November 201
Model-Based Robot Control and Multiprocessor Implementation
Model-based control of robot manipulators has been gaining momentum in recent years. Unfortunately there are very few experimental validations to accompany simulation results and as such majority of conclusions drawn lack the credibility associated with the real control implementation
- …