Search CORE

243 research outputs found

Dynamic Vectorization in the E2 Dynamic Multicore System

Author: Burger Doug
Putnam Andrew
Smith Aaron
Publication venue
Publication date: 01/06/2010
Field of study

A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs

Author: Doug Burger
Jeremy Fowers
Joo-Young Kim
Scott Hauck
Publication venue
Publication date: 24/04/2020
Field of study

Abstract-Data compression techniques have been the subject of intense study over the past several decades due to exponential increases in the quantity of data stored and transmitted by computer systems. Compression algorithms are traditionally forced to make tradeoffs between throughput and compression quality (the ratio of original file size to compressed file size). FPGAs represent a compelling substrate for streaming applications such as data compression thanks to their capacity for deep pipelines and custom caching solutions. Unfortunately, data hazards in compression algorithms such as LZ77 inhibit the creation of deep pipelines without sacrificing some amount of compression quality. In this work we detail a scalable fully pipelined FPGA accelerator that performs LZ77 compression and static Huffman encoding at rates up to 5.6 GB/s. Furthermore, we explore tradeoffs between compression quality and FPGA area that allow the same throughput at a fraction of the logic utilization in exchange for moderate reductions in compression quality. Compared to recent FPGA compression studies, our emphasis on scalability gives our accelerator a 3.0x advantage in resource utilization at equivalent throughput and compression ratio

CiteSeerX

High performance dense linear algebra on a spatially distributed processor

Author: Behnam Robatmili
Doug Burger
Jeff Diamond
Kazushige Goto
Robert van de Geijn
Stephen W. Keckler
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

As technology trends have limited the performance scaling of conventional processors, industry and academic research has turned to parallel architectures on a single chip, including distributed uniprocessors and multicore chips. This paper examines how to extend the archtypical operation of dense linear algebra, matrix multiply, to an emerging class of uniprocessor architectures characterized by a large number of independent functional units, register banks, and cache banks connected by a 2-D on-chip network. We extend the well known algorithm for matrix multiplication by Goto to this spatially distributed class of uniprocessor and describe the optimizations of the innermost kernel, a systolic-like algorithm running on a general purpose uniprocessor. The resulting implementation yields the first demonstration of high-performance in an application executing on the TRIPS processor hardware, a next-generation distributed processor core. We show that such processors are indeed capable of substantial improvements in single threaded performance provided their spatial topography is taken into account

CiteSeerX

Crossref

Shared Microexponents: A Little Shifting Goes a Long Way

Author: Burger Doug
Chung Eric
Deng Zhaoxia
Elango Venmugil
Golub Maximilian
Hall Mathew
Klar Jasmine
Kolhe Gaurav
L'Heureux Renee
Melnick Levi
Melts Dimitry
Mesmakhosroshahi Maral
More Ankit
Naghshineh Sam
Naumov Maxim
Park Jongsoo
Perry Matt
Rouhani Bita
Shafipour Rasoul
Shao Lei
Varatkar Girish
Zhao Ritchie
Publication venue
Publication date: 15/02/2023
Field of study

This paper introduces Block Data Representations (BDR), a framework for exploring and evaluating a wide spectrum of narrow-precision formats for deep learning. It enables comparison of popular quantization standards, and through BDR, new formats based on shared microexponents (MX) are identified, which outperform other state-of-the-art quantization approaches, including narrow-precision floating-point and block floating-point. MX utilizes multiple levels of quantization scaling with ultra-fine scaling factors based on shared microexponents in the hardware. The effectiveness of MX is demonstrated on real-world models including large-scale generative pretraining and inferencing, and production-scale recommendation systems

arXiv.org e-Print Archive

A multi-decade record of high quality fCO2 data in version 3 of the Surface Ocean CO2 Atlas (SOCAT)

Author: Alin Simone
Bakker Dorothee C. E.
Balestrini Carlos
Barbero Leticia
Bates Nicholas R.
Bianchi Alejandro A.
Bonou Frédéric
Boutin Jacqueline
Bozec Yann
Burger Eugene F.
Cai Wei-Jun
Castle Robert D.
Chen Liqi
Chierici Melissa
Cosca Cathy
Currie Kim
Evans Wiley
Featherstone Charles
Feely Richard A.
Fransson Agneta
Goyet Catherine
Greenwood Naomi
Gregor Luke
Hankin Steven
Harasawa Sumiko
Hardman-Mountford Nick J.
Harlay Jérôme
Hauck Judith
Hoppema Mario
Humphreys Matthew P.
Hunt Christopher W.
Huss Betty
Ibánhez J. Severino P.
Johannessen Truls
Jones Stephen D.
Keeling Ralph
Kitidis Vassilis
Kozyr Alex
Krasakopoulou Evangelia
Kuwata Akira
Körtzinger Arne
Landa Camilla S.
Landschützer Peter
Lauvset Siv K.
Lefèvre Nathalie
Manke Ansley
Mathis Jeremy T.
Merlivat Liliane
Metzl Nicolas
Millero Frank J.
Monaco Claire Lo
Monteiro Pedro M. S.
Munro David R.
Murata Akihiko
Nakaoka Shin-ichiro
Newberger Timothy
Nojiri Yukihiro
O'Brien Kevin M.
Olsen Are
Omar Abdirahman M.
Ono Tsuneo
Paterson Kristina
Pearce David
Pfeil Benjamin
Pierrot Denis
Robbins Lisa L.
Saito Shu
Salisbury Joe
Schlitzer Reiner
Schneider Bernd
Schuster Ute
Schweitzer Roland
Sieger Rainer
Skjelvan Ingunn
Smith Karl
Steinhoff Tobias
Sullivan Kevin F.
Sutherland Stewart C.
Sutton Adrienne J.
Sweeney Colm
Tadokoro Kazuaki
Takahashi Taro
Telszewski Maciej
Tilbrook Bronte
Tuma Matthias
Van Heuven Steven M. A. C.
Vandemark Doug
Wada Chisato
Wanninkhof Rik
Ward Brian
Watson Andrew J.
Xu Suqing
Publication venue: 'Copernicus GmbH'
Publication date: 01/01/2016
Field of study

The Surface Ocean CO2 Atlas (SOCAT) is a synthesis of quality-controlled fCO2 (fugacity of carbon dioxide) values for the global surface oceans and coastal seas with regular updates. Version 3 of SOCAT has 14.7 million fCO2 values from 3646 data sets covering the years 1957 to 2014. This latest version has an additional 4.6 million fCO2 values relative to version 2 and extends the record from 2011 to 2014. Version 3 also significantly increases the data availability for 2005 to 2013. SOCAT has an average of approximately 1.2 million surface water fCO2 values per year for the years 2006 to 2012. Quality and documentation of the data has improved. A new feature is the data set quality control (QC) flag of E for data from alternative sensors and platforms. The accuracy of surface water fCO2 has been defined for all data set QC flags. Automated range checking has been carried out for all data sets during their upload into SOCAT. The upgrade of the interactive Data Set Viewer (previously known as the Cruise Data Viewer) allows better interrogation of the SOCAT data collection and rapid creation of high-quality figures for scientific presentations. Automated data upload has been launched for version 4 and will enable more frequent SOCAT releases in the future. High-profile scientific applications of SOCAT include quantification of the ocean sink for atmospheric carbon dioxide and its long-term variation, detection of ocean acidification, as well as evaluation of coupled-climate and ocean-only biogeochemical models. Users of SOCAT data products are urged to acknowledge the contribution of data providers, as stated in the SOCAT Fair Data Use Statement. This ESSD (Earth System Science Data) “living data” publication documents the methods and data sets used for the assembly of this new version of the SOCAT data collection and compares these with those used for earlier versions of the data collection (Pfeil et al., 2013; Sabine et al., 2013; Bakker et al., 2014). Individual data set files, included in the synthesis product, can be downloaded here: doi:10.1594/PANGAEA.849770. The gridded products are available here: doi:10.3334/CDIAC/OTG.SOCAT_V3_GRID

Microscaling Data Formats for Deep Learning

Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction. We also show the first instance of training generative language models at sub-8-bit weights, activations, and gradients with minimal accuracy loss and no modifications to the training recipe

arXiv.org e-Print Archive

Development of Gaze Following Abilities in Wolves (Canis Lupus)

Author: A Miklosi
A Wilkinson
B Agnetta
B Hare
C Schloegl
C Schloegl
CL Russell
D Mech
D Povinelli
DJ Povinelli
Doug Wylie
F Amici
Friederike Range
G Butterworth
J Bräuer
J Burger
J Burkart
J Call
J Kaminski
J Triesch
JC Gomez
JM Burkart
M Gacsi
M Tomasello
M Tomasello
M Tomasello
M Tomasello
MAR Udell
MC Loretto
NJ Emery
R Brooks
RR Hampton
S Baron-Cohen
S Okamoto
S Siegel
SV Shepherd
T Bugnyar
WT Fitch
Zsófia Virányi
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

The ability to coordinate with others' head and eye orientation to look in the same direction is considered a key step towards an understanding of others mental states like attention and intention. Here, we investigated the ontogeny and habituation patterns of gaze following into distant space and behind barriers in nine hand-raised wolves. We found that these wolves could use conspecific as well as human gaze cues even in the barrier task, which is thought to be more cognitively advanced than gazing into distant space. Moreover, while gaze following into distant space was already present at the age of 14 weeks and subjects did not habituate to repeated cues, gazing around a barrier developed considerably later and animals quickly habituated, supporting the hypothesis that different cognitive mechanisms may underlie the two gaze following modalities. More importantly, this study demonstrated that following another individuals' gaze around a barrier is not restricted to primates and corvids but is also present in canines, with remarkable between-group similarities in the ontogeny of this behaviour. This sheds new light on the evolutionary origins of and selective pressures on gaze following abilities as well as on the sensitivity of domestic dogs towards human communicative cues

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Permanent Hosting, Archiving and Indexing of Digital Resources and Assets

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Author: Burger Doug
Caulfield Adrian M.
Chiou Derek
Chung Eric S.
Constantinides Kypros
Demme John
Esmaeilzadeh Hadi
Fowers Jeremy
Gopal Gopi Prashanth
Gray Jan
Haselman Michael
Hauck Scott
Heil Stephen
Hormati Amir
Kim Joo-Young
Lanka Sitaram
Larus James
Peterson Eric
Pope Simon
Putnam Andrew
Smith Aaron
Thong Jason
Xiao Phillip Yi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/01/2017
Field of study

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth network. We describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system. Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput. In other words, the reconfigurable fabric enables the same throughput using only half the number of servers

Infoscience - École polytechnique fédérale de Lausanne

Suppressing quantum errors by scaling a surface code logical qubit

Author: Acharya Rajeev
Aleiner Igor
Allen Richard
Andersen Trond I.
Ansmann Markus
Arute Frank
Arya Kunal
Asfaw Abraham
Atalaya Juan
Babbush Ryan
Bacon Dave
Barba Alexander Del Toro
Bardin Joseph C.
Basso Joao
Bengtsson Andreas
Boixo Sergio
Bortoli Gina
Bourassa Alexandre
Bovaird Jenna
Brill Leon
Broughton Michael
Buckley Bob B.
Buell David A.
Burger Tim
Burgos Leslie Flores
Burkett Brian
Bushnell Nicholas
Chen Yu
Chen Zijun
Chiaro Ben
Cogan Josh
Collins Roberto
Conner Paul
Costa Bernardo Meurer
Courtney William
Crook Alexander L.
Curtin Ben
Dau Alejandro Grajales
Debroy Dripto M.
Demura Sean
Dunsworth Andrew
Eppens Daniel
Erickson Catherine
Faoro Lara
Farhi Edward
Fatemi Reza
Forati Ebrahim
Fowler Austin G.
Foxen Brooks
Giang William
Gidney Craig
Gilboa Dar
Giustina Marissa
Gross Jonathan A.
Habegger Steve
Hamilton Michael C.
Harrigan Matthew P.
Harrington Sean D.
Heidweiller Catherine Vollgraff
Higgott Oscar
Hilton Jeremy
Hoffmann Markus
Hong Sabrina
Huang Trent
Huff Ashley
Huggins William J.
Ioffe Lev B.
Isakov Sergei V.
Iveland Justin
Jeffrey Evan
Jiang Zhang
Jones Cody
Juhas Pavol
Kafri Dvir
Kechedzhi Kostyantyn
Kelly Julian
Khattar Tanuj
Khezri Mostafa
Kieferová Mária
Kim Seon
Kitaev Alexei
Klimov Paul V.
Klots Andrey R.
Korotkov Alexander N.
Kostritsa Fedor
Kreikebaum John Mark
Landhuis David
Laptev Pavel
Lau Kim-Ming
Laws Lily
Lee Joonho
Lee Kenny
Lester Brian J.
Lill Alexander
Liu Wayne
Locharla Aditya
Lucero Erik
Malone Fionn D.
Marshall Jeffrey
Martin Orion
McClean Jarrod R.
Mccourt Trevor
McEwen Matt
Megrant Anthony
Mi Xiao
Miao Kevin C.
Mohseni Masoud
Montazeri Shirin
Morvan Alexis
Mount Emily
Mruczkiewicz Wojciech
Naaman Ofer
Neeley Matthew
Neill Charles
Nersisyan Ani
Neven Hartmut
Newman Michael
Ng Jiun How
Nguyen Anthony
Nguyen Murray
Niu Murphy Yuezhen
O'Brien Thomas E.
Opremcak Alex
Petukhov Andre
Platt John
Potter Rebecca
Pryadko Leonid P.
Quintana Chris
Roushan Pedram
Rubin Nicholas C.
Saei Negar
Sank Daniel
Sankaragomathi Kannan
Satzinger Kevin J.
Schurkus Henry F.
Schuster Christopher
Shearn Michael J.
Shorter Aaron
Shvarts Vladimir
Skruzny Jindra
Smelyanskiy Vadim
Smith W. Clarke
Sterling George
Strain Doug
Szalay Marco
Torres Alfredo
Vidal Guifre
Villalonga Benjamin
White Theodore
Xing Cheng
Yao Z. Jamie
Yeh Ping
Yoo Juhwan
Young Grayson
Zalcman Adam
Zhang Yaxing
Zhu Ningfeng
Publication venue
Publication date: 20/07/2022
Field of study

Practical quantum computing will require error rates that are well below what is achievable with physical qubits. Quantum error correction offers a path to algorithmically-relevant error rates by encoding logical qubits within many physical qubits, where increasing the number of physical qubits enhances protection against physical errors. However, introducing more qubits also increases the number of error sources, so the density of errors must be sufficiently low in order for logical performance to improve with increasing code size. Here, we report the measurement of logical qubit performance scaling across multiple code sizes, and demonstrate that our system of superconducting qubits has sufficient performance to overcome the additional errors from increasing qubit number. We find our distance-5 surface code logical qubit modestly outperforms an ensemble of distance-3 logical qubits on average, both in terms of logical error probability over 25 cycles and logical error per cycle (

2.914\%\pm 0.016\%

compared to

3.028\%\pm 0.023\%

). To investigate damaging, low-probability error sources, we run a distance-25 repetition code and observe a

1.7\times10^{-6}

logical error per round floor set by a single high-energy event (

1.6\times10^{-7}

when excluding this event). We are able to accurately model our experiment, and from this model we can extract error budgets that highlight the biggest challenges for future systems. These results mark the first experimental demonstration where quantum error correction begins to improve performance with increasing qubit number, illuminating the path to reaching the logical error rates required for computation.Comment: Main text: 6 pages, 4 figures. v2: Update author list, references, Fig. S12, Table I

arXiv.org e-Print Archive