5,669 research outputs found
Speeding-up the decision making of a learning agent using an ion trap quantum processor
We report a proof-of-principle experimental demonstration of the quantum
speed-up for learning agents utilizing a small-scale quantum information
processor based on radiofrequency-driven trapped ions. The decision-making
process of a quantum learning agent within the projective simulation paradigm
for machine learning is implemented in a system of two qubits. The latter are
realized using hyperfine states of two frequency-addressed atomic ions exposed
to a static magnetic field gradient. We show that the deliberation time of this
quantum learning agent is quadratically improved with respect to comparable
classical learning agents. The performance of this quantum-enhanced learning
agent highlights the potential of scalable quantum processors taking advantage
of machine learning.Comment: 21 pages, 7 figures, 2 tables. Author names now spelled correctly;
sections rearranged; changes in the wording of the manuscrip
Metascheduling of HPC Jobs in Day-Ahead Electricity Markets
High performance grid computing is a key enabler of large scale collaborative
computational science. With the promise of exascale computing, high performance
grid systems are expected to incur electricity bills that grow super-linearly
over time. In order to achieve cost effectiveness in these systems, it is
essential for the scheduling algorithms to exploit electricity price
variations, both in space and time, that are prevalent in the dynamic
electricity price markets. In this paper, we present a metascheduling algorithm
to optimize the placement of jobs in a compute grid which consumes electricity
from the day-ahead wholesale market. We formulate the scheduling problem as a
Minimum Cost Maximum Flow problem and leverage queue waiting time and
electricity price predictions to accurately estimate the cost of job execution
at a system. Using trace based simulation with real and synthetic workload
traces, and real electricity price data sets, we demonstrate our approach on
two currently operational grids, XSEDE and NorduGrid. Our experimental setup
collectively constitute more than 433K processors spread across 58 compute
systems in 17 geographically distributed locations. Experiments show that our
approach simultaneously optimizes the total electricity cost and the average
response time of the grid, without being unfair to users of the local batch
systems.Comment: Appears in IEEE Transactions on Parallel and Distributed System
Limits on Fundamental Limits to Computation
An indispensable part of our lives, computing has also become essential to
industries and governments. Steady improvements in computer hardware have been
supported by periodic doubling of transistor densities in integrated circuits
over the last fifty years. Such Moore scaling now requires increasingly heroic
efforts, stimulating research in alternative hardware and stirring controversy.
To help evaluate emerging technologies and enrich our understanding of
integrated-circuit scaling, we review fundamental limits to computation: in
manufacturing, energy, physical space, design and verification effort, and
algorithms. To outline what is achievable in principle and in practice, we
recall how some limits were circumvented, compare loose and tight limits. We
also point out that engineering difficulties encountered by emerging
technologies may indicate yet-unknown limits.Comment: 15 pages, 4 figures, 1 tabl
Approximate Inference for Constructing Astronomical Catalogs from Images
We present a new, fully generative model for constructing astronomical
catalogs from optical telescope image sets. Each pixel intensity is treated as
a random variable with parameters that depend on the latent properties of stars
and galaxies. These latent properties are themselves modeled as random. We
compare two procedures for posterior inference. One procedure is based on
Markov chain Monte Carlo (MCMC) while the other is based on variational
inference (VI). The MCMC procedure excels at quantifying uncertainty, while the
VI procedure is 1000 times faster. On a supercomputer, the VI procedure
efficiently uses 665,000 CPU cores to construct an astronomical catalog from 50
terabytes of images in 14.6 minutes, demonstrating the scaling characteristics
necessary to construct catalogs for upcoming astronomical surveys.Comment: accepted to the Annals of Applied Statistic
A Fast Potential and Self-Gravity Solver for Non-Axisymmetric Disks
Disk self-gravity could play an important role in the dynamic evolution of
interaction between disks and embedded protoplanets. We have developed a fast
and accurate solver to calculate the disk potential and disk self-gravity
forces for disk systems on a uniform polar grid. Our method follows closely the
method given by Chan et al. (2006), in which an FFT in the azimuthal direction
is performed and a direct integral approach in the frequency domain in the
radial direction is implemented on a uniform polar grid. This method can be
very effective for disks with vertical structures that depend only on the disk
radius, achieving the same computational efficiency as for zero-thickness
disks. We describe how to parallelize the solver efficiently on distributed
parallel computers. We propose a mode-cutoff procedure to reduce the parallel
communication cost and achieve nearly linear scalability for a large number of
processors. For comparison, we have also developed a particle-based fast
tree-code to calculate the self-gravity of the disk system with vertical
structure. The numerical results show that our direct integral method is at
least two order of magnitudes faster than our optimized tree-code approach.Comment: 8 figures, accepted to ApJ
Engineer the Channel and Adapt to it: Enabling Wireless Intra-Chip Communication
Ubiquitous multicore processors nowadays rely on an integrated
packet-switched network for cores to exchange and share data. The performance
of these intra-chip networks is a key determinant of the processor speed and,
at high core counts, becomes an important bottleneck due to scalability issues.
To address this, several works propose the use of mm-wave wireless
interconnects for intra-chip communication and demonstrate that, thanks to
their low-latency broadcast and system-level flexibility, this new paradigm
could break the scalability barriers of current multicore architectures.
However, these same works assume 10+ Gb/s speeds and efficiencies close to 1
pJ/bit without a proper understanding on the wireless intra-chip channel. This
paper first demonstrates that such assumptions do not hold in the context of
commercial chips by evaluating losses and dispersion in them. Then, we leverage
the system's monolithic nature to engineer the channel, this is, to optimize
its frequency response by carefully choosing the chip package dimensions.
Finally, we exploit the static nature of the channel to adapt to it, pushing
efficiency-speed limits with simple tweaks at the physical layer. Our methods
reduce the path loss and delay spread of a simulated commercial chip by 47 dB
and 7.3x, respectively, enabling intra-chip wireless communications over 10
Gb/s and only 3.1 dB away from the dispersion-free case.Comment: 12 pages, 10 figures. IEEE Transactions on Communications Journal,
202
Vector coprocessor sharing techniques for multicores: performance and energy gains
Vector Processors (VPs) created the breakthroughs needed for the emergence of computational science many years ago. All commercial computing architectures on the market today contain some form of vector or SIMD processing.
Many high-performance and embedded applications, often dealing with streams of data, cannot efficiently utilize dedicated vector processors for various reasons: limited percentage of sustained vector code due to substantial flow control; inherent small parallelism or the frequent involvement of operating system tasks; varying vector length across applications or within a single application; data dependencies within short sequences of instructions, a problem further exacerbated without loop unrolling or other compiler optimization techniques. Additionally, existing rigid SIMD architectures cannot tolerate efficiently dynamic application environments with many cores that may require the runtime adjustment of assigned vector resources in order to operate at desired energy/performance levels.
To simultaneously alleviate these drawbacks of rigid lane-based VP architectures, while also releasing on-chip real estate for other important design choices, the first part of this research proposes three architectural contexts for the implementation of a shared vector coprocessor in multicore processors. Sharing an expensive resource among multiple cores increases the efficiency of the functional units and the overall system throughput. The second part of the dissertation regards the evaluation and characterization of the three proposed shared vector architectures from the performance and power perspectives on an FPGA (Field-Programmable Gate Array) prototype. The third part of this work introduces performance and power estimation models based on observations deduced from the experimental results. The results show the opportunity to adaptively adjust the number of vector lanes assigned to individual cores or processing threads in order to minimize various energy-performance metrics on modern vector- capable multicore processors that run applications with dynamic workloads. Therefore, the fourth part of this research focuses on the development of a fine-to-coarse grain power management technique and a relevant adaptive hardware/software infrastructure which dynamically adjusts the assigned VP resources (number of vector lanes) in order to minimize the energy consumption for applications with dynamic workloads. In order to remove the inherent limitations imposed by FPGA technologies, the fifth part of this work consists of implementing an ASIC (Application Specific Integrated Circuit) version of the shared VP towards precise performance-energy studies involving high- performance vector processing in multicore environments
- …