243 research outputs found
A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs
Abstract-Data compression techniques have been the subject of intense study over the past several decades due to exponential increases in the quantity of data stored and transmitted by computer systems. Compression algorithms are traditionally forced to make tradeoffs between throughput and compression quality (the ratio of original file size to compressed file size). FPGAs represent a compelling substrate for streaming applications such as data compression thanks to their capacity for deep pipelines and custom caching solutions. Unfortunately, data hazards in compression algorithms such as LZ77 inhibit the creation of deep pipelines without sacrificing some amount of compression quality. In this work we detail a scalable fully pipelined FPGA accelerator that performs LZ77 compression and static Huffman encoding at rates up to 5.6 GB/s. Furthermore, we explore tradeoffs between compression quality and FPGA area that allow the same throughput at a fraction of the logic utilization in exchange for moderate reductions in compression quality. Compared to recent FPGA compression studies, our emphasis on scalability gives our accelerator a 3.0x advantage in resource utilization at equivalent throughput and compression ratio
High performance dense linear algebra on a spatially distributed processor
As technology trends have limited the performance scaling of conventional processors, industry and academic research has turned to parallel architectures on a single chip, including distributed uniprocessors and multicore chips. This paper examines how to extend the archtypical operation of dense linear algebra, matrix multiply, to an emerging class of uniprocessor architectures characterized by a large number of independent functional units, register banks, and cache banks connected by a 2-D on-chip network. We extend the well known algorithm for matrix multiplication by Goto to this spatially distributed class of uniprocessor and describe the optimizations of the innermost kernel, a systolic-like algorithm running on a general purpose uniprocessor. The resulting implementation yields the first demonstration of high-performance in an application executing on the TRIPS processor hardware, a next-generation distributed processor core. We show that such processors are indeed capable of substantial improvements in single threaded performance provided their spatial topography is taken into account
Shared Microexponents: A Little Shifting Goes a Long Way
This paper introduces Block Data Representations (BDR), a framework for
exploring and evaluating a wide spectrum of narrow-precision formats for deep
learning. It enables comparison of popular quantization standards, and through
BDR, new formats based on shared microexponents (MX) are identified, which
outperform other state-of-the-art quantization approaches, including
narrow-precision floating-point and block floating-point. MX utilizes multiple
levels of quantization scaling with ultra-fine scaling factors based on shared
microexponents in the hardware. The effectiveness of MX is demonstrated on
real-world models including large-scale generative pretraining and inferencing,
and production-scale recommendation systems
A multi-decade record of high quality fCO2 data in version 3 of the Surface Ocean CO2 Atlas (SOCAT)
The Surface Ocean CO2 Atlas (SOCAT) is a synthesis of quality-controlled fCO2 (fugacity of carbon dioxide) values for the global surface oceans and coastal seas with regular updates. Version 3 of SOCAT has 14.7 million fCO2 values from 3646 data sets covering the years 1957 to 2014. This latest version has an additional 4.6 million fCO2 values relative to version 2 and extends the record from 2011 to 2014. Version 3 also significantly increases the data availability for 2005 to 2013. SOCAT has an average of approximately 1.2 million surface water fCO2 values per year for the years 2006 to 2012. Quality and documentation of the data has improved. A new feature is the data set quality control (QC) flag of E for data from alternative sensors and platforms. The accuracy of surface water fCO2 has been defined for all data set QC flags. Automated range checking has been carried out for all data sets during their upload into SOCAT. The upgrade of the interactive Data Set Viewer (previously known as the Cruise Data Viewer) allows better interrogation of the SOCAT data collection and rapid creation of high-quality figures for scientific presentations. Automated data upload has been launched for version 4 and will enable more frequent SOCAT releases in the future. High-profile scientific applications of SOCAT include quantification of the ocean sink for atmospheric carbon dioxide and its long-term variation, detection of ocean acidification, as well as evaluation of coupled-climate and ocean-only biogeochemical models. Users of SOCAT data products are urged to acknowledge the contribution of data providers, as stated in the SOCAT Fair Data Use Statement. This ESSD (Earth System Science Data) âliving dataâ publication documents the methods and data sets used for the assembly of this new version of the SOCAT data collection and compares these with those used for earlier versions of the data collection (Pfeil et al., 2013; Sabine et al., 2013; Bakker et al., 2014). Individual data set files, included in the synthesis product, can be downloaded here: doi:10.1594/PANGAEA.849770. The gridded products are available here: doi:10.3334/CDIAC/OTG.SOCAT_V3_GRID
Microscaling Data Formats for Deep Learning
Narrow bit-width data formats are key to reducing the computational and
storage costs of modern deep learning applications. This paper evaluates
Microscaling (MX) data formats that combine a per-block scaling factor with
narrow floating-point and integer types for individual elements. MX formats
balance the competing needs of hardware efficiency, model accuracy, and user
friction. Empirical results on over two dozen benchmarks demonstrate
practicality of MX data formats as a drop-in replacement for baseline FP32 for
AI inference and training with low user friction. We also show the first
instance of training generative language models at sub-8-bit weights,
activations, and gradients with minimal accuracy loss and no modifications to
the training recipe
Development of Gaze Following Abilities in Wolves (Canis Lupus)
The ability to coordinate with others' head and eye orientation to look in the same direction is considered a key step towards an understanding of others mental states like attention and intention. Here, we investigated the ontogeny and habituation patterns of gaze following into distant space and behind barriers in nine hand-raised wolves. We found that these wolves could use conspecific as well as human gaze cues even in the barrier task, which is thought to be more cognitively advanced than gazing into distant space. Moreover, while gaze following into distant space was already present at the age of 14 weeks and subjects did not habituate to repeated cues, gazing around a barrier developed considerably later and animals quickly habituated, supporting the hypothesis that different cognitive mechanisms may underlie the two gaze following modalities. More importantly, this study demonstrated that following another individuals' gaze around a barrier is not restricted to primates and corvids but is also present in canines, with remarkable between-group similarities in the ontogeny of this behaviour. This sheds new light on the evolutionary origins of and selective pressures on gaze following abilities as well as on the sensitivity of domestic dogs towards human communicative cues
A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth network. We describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system. Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput. In other words, the reconfigurable fabric enables the same throughput using only half the number of servers
Suppressing quantum errors by scaling a surface code logical qubit
Practical quantum computing will require error rates that are well below what
is achievable with physical qubits. Quantum error correction offers a path to
algorithmically-relevant error rates by encoding logical qubits within many
physical qubits, where increasing the number of physical qubits enhances
protection against physical errors. However, introducing more qubits also
increases the number of error sources, so the density of errors must be
sufficiently low in order for logical performance to improve with increasing
code size. Here, we report the measurement of logical qubit performance scaling
across multiple code sizes, and demonstrate that our system of superconducting
qubits has sufficient performance to overcome the additional errors from
increasing qubit number. We find our distance-5 surface code logical qubit
modestly outperforms an ensemble of distance-3 logical qubits on average, both
in terms of logical error probability over 25 cycles and logical error per
cycle ( compared to ). To investigate
damaging, low-probability error sources, we run a distance-25 repetition code
and observe a logical error per round floor set by a single
high-energy event ( when excluding this event). We are able
to accurately model our experiment, and from this model we can extract error
budgets that highlight the biggest challenges for future systems. These results
mark the first experimental demonstration where quantum error correction begins
to improve performance with increasing qubit number, illuminating the path to
reaching the logical error rates required for computation.Comment: Main text: 6 pages, 4 figures. v2: Update author list, references,
Fig. S12, Table I
- âŠ