

# Standardizing Microprocessor and GPU Radiation Test Approaches

Edward J. Wyrwas edward.j.wyrwas@nasa.gov 301-286-5213 SSAI, Inc / NASA GSFC NEPP

#### This work was sponsored by:

NASA Electronic Parts and Packaging (NEPP) Program NASA Office of Safety & Mission Assurance



# **Acronyms**

- Body of Knowledge (BOK)
- Complementary metal oxide semiconductor (CMOS)
- Commercial off-the-shelf (COTS)
- Device under test (DUT)
- Electrical, Electronic and Electromechanical (EEE)
- Energy (E)
- Error rate (λ)
- Field programmable gate array (FPGA)
- Fin Field-effect transistor (FinFET)
- Graphics Processing Unit (GPU)
- Joint Electron Device Engineering Council
   Society of Automotive Engineers (SAE) (JEDEC)
- Linear energy transfer (LET)
- Key Process Indicators (KPI)
- Mean time to failure (MTTF)
- Multi-Bit Upset (MBU)

- National Aeronautic and Space Administration (NASA)
- NASA Electronic Parts and Packaging (NEPP)
- Package-in-package (PIP)
- Package-on-package (POP)
- Single-Bit Upset (SBU)
- Single Event Effect (SEE)
- Single Event Functional Interrupt (SEFI)
- Single Event Upset (SEU)
- Single Event Upset Cross-Section (σ<sub>SEU</sub>)
- Single Instruction Multiple Data (SIMD)
- System on Chip (SOC)
- System on Module (SOM)
- Technical Operation Report (TOR)



### **Outline**

- NASA Electronic Part and Packaging (NEPP) Program
- High Performance Processing Components
- Application Environments and "Key Process Indicators"
- Design of Experiment
- Test Vectors
- Thoughts and Conclusions



### **Outline**

- NASA Electronic Part and Packaging Program (NEPP)
- High Performance Processing Components
- Application Environments and "Key Process Indicators"
- Design of Experiment
- Test Vectors
- Thoughts and Conclusions



### **NEPP Mission Statement**

Provide NASA's leadership for developing and maintaining guidance for the screening, qualification, test, and reliable use of Electrical, Electronic, and Electromechanical (EEE) parts by NASA, in collaboration with other government agencies and industry.





# Radiation Assurance Requires Synchronous Integration

This is why radiation engineers tend to answer with "it depends..."



- Considerations summarized in these elements allow designers to effectively choose parts for their best performance in a given architecture
- Comprehension requires a complete synchronous picture of how technologies are to be used effectively
- Emphasizing one of these elements without understanding the others can compromise the integrity and performance of the parts and mission success

Adapted from Adapted from NASA Technical Report TM-2018-220074



# **Design Characterization & Qualification Trade Space**





# **System-Level Assurance**

(Space Users Perspective)

- Always faced with conflicting demands between
  - "Just Make It Work," and
  - "Just Make It Cheap"
- Many system-level mitigation strategies pre-date the space age (e.g., communications, faulttolerant computing, etc.)
- Tiered approach to validation of mission / product requirements



R. Ladbury, IEEE NSREC Short Course, July 2007.



### **NEPP – Processors**

#### State of the Art COTS Processors

- •Sub 32nm CMOS, FinFETs, etc
- •Samsung, Intel,

#### "Space" FPGAs

- Microsemi RTG4
- Xilinx MPSOC+
- ESA Brave (future)
- "Trusted" FPGA (future)

#### Graphics Processor Units (GPUs)

- •Intel, AMD, Nvidia
- •Enabling data processing

#### **COTS FPGAs**

- Xilinx Kintex+
- Mitigation evaluation
- TBD: Microsemi PolarFire

#### Radiation Hardened Processor Evaluation

- ·BAE
- Vorago (microcontrollers)

Best
Practices
and
Guidelines

#### Partnering

- Processors: Navy Crane, BAE/NRO-
- •FPGAs: AF SMC, SNL, LANL, BYU,...
- Microsemi, Xilinx, Synopsis
- Cubic Aerospace



### **Outline**

- NASA Electronic Part and Packaging Program (NEPP)
- High Performance Processing Components
- Application Environments and "Key Process Indicators"
- Design of Experiment
- Test Vectors
- Thoughts and Conclusions



# **Modern Components**

- We can no longer rely on FPGAs being as immune or resilient as their predecessors. They are complex mixed signal systems with compute engines.
- We may still use some legacy parts with well known reliability and radiation tolerances but we also test leading edge computational components
  - Microprocessors (e.g., x86, x64, ARM, Power Arch.)
  - GPUs (e.g., nVidia, AMD, Qualcomm)
  - Memories (e.g., 3D Xpoint, PCM, DDR3, DDR4, Flash)



# **Technology**

- Computational device families are converging
- Using high-level languages, applications are accelerated by running the sequential part of their workload on the CPU – which is optimized for single-threaded performance – and accelerating parallel processing on embedded engines or coprocessor devices
- Key computation pieces of mission applications can be computed using coprocessors and edge devices
  - Sensor and science instrument input
  - Object tracking and obstacle identification
  - Algorithm convergence (e.g., neural network, simulations)
  - Image processing
  - Data compression algorithms and encryption



### FPGA vs GPU vs CPU

**FPGA** 

**Hardware** 

Complete system

low-power

**GPU** 

Software (bare-metal)

Accelerator is useless alone, but ON when necessary

it depends (low-power/operation)

**CPU** 

Software (+ OS)

Complete system

+/- low-power

Floating-point operations (neural-net, image, radar)
High amount of data to analyze
High efficiency/high bandwidth applications



### **DDR** Interface

- Often found in PIP, POP and Stacked Die processors
- Multi-bit error correction features can be employed
- Cell disturbance via Rowhammer has manifested in DDR3 & DDR4 due to feature scaling
- Typical software model:
  - 1. Flight computers boot from ROM, but tend to run from RAM
  - 2. RAM permits larger data sets to be processed concurrently







### **COTS Devices**





**Nvidia GTX 1050 GPU** 



**Smart Phones** 



**Intel Skylake Processor** 





### **Evaluation Timeline**





### **Outline**

- NASA Electronic Part and Packaging Program (NEPP)
- High Performance Processing Components
- Application Environments and "Key Process Indicators"
- Design of Experiment
- Test Vectors
- Thoughts and Conclusions



# **Aerospace Applications**



ARAMIS: parallel processing to reduce the number of boards on aircrafts

### **COROT2:** first **GPU-powered** satellite





**GPU compresses images to send** 





## **Potential GPU Space Uses**

**GPUs can disclose A.I. for space exploration** 





GPU-aided landing
-high amount of data to be
processed in real-time

- -Radar/communication in general
- -Images compression/analysis
- -Process data in space





# **Semi-Autonomous Planetary Exploration**



https://solarsystem.nasa.gov/news/472/10-things-mars-helicopter/



# **Energetic Particle Sources**

- High-energy particles impact Earth's atmosphere and create air showers that generate a variety of particles (e.g., neutrons etc.) that reach ground level – fluxes are anisotropic
- Depends on latitude/longitude, atmospheric depth, and solar activity
- Process contamination in wafer fab materials
- Trace elements in packaging and in metallic (e.g., Pb) bumps
- <sup>232</sup>Th and <sup>238</sup>U are relatively abundant in terrestrial materials used in electronics processing and active enough to be a radiation effects concern





# **Linear Energy Transfer (LET)**

- LET characterizes the deposition of charged particles
- LET is based on average energy loss per unit path length (stopping power)
- Mass is used to normalize LET to the target material
- LET is the average energy deposited per unit path length

$$LET = \frac{1}{\rho} \frac{dE}{dx} \qquad \left( \frac{cm^2}{mg} \right)$$
Units

**Density of target material** 





# **Characterizing SEUs**

SEU Cross Sections ( $\sigma_{sen}$ ) characterize how many upsets will occur based on ionizing particle exposure.

$$\sigma_{seu} = \frac{\#errors}{fluence}$$
 • Flux (rate): Particles/(sec·cm²) • Fluence: Particles/cm²

#### Terminology:

- σ<sub>seu</sub> is calculated at several LET values



### **Outline**

- NASA Electronic Part and Packaging Program (NEPP)
- High Performance Processing Components
- Application Environments and "Key Process Indicators"
- Design of Experiment
- Test Vectors
- Thoughts and Conclusions



## **Types of Radiation Tests**

#### 2-photon Laser Testing

 Identify any extra-sensitive areas in the active area such that the characteristics of any events can be later evaluated against those results

#### **Heavy-ion testing**

- Determine effects of different levels of Linear Energy Transfer (LET)
- Modern process technologies may be susceptible to destructive SEE
  - Heavy-ion testing is the only way to fully evaluate destructive SEE
  - If effects are apparent below 20 MeV.cm²/mg, proton testing may be necessary

#### **Proton testing**

- Will evaluate SEE-induced transients, SEFIs, and accessible device power-states
- 200MeV protons generate secondary nuclear products with short range
- Not capable of reliably causing destructive events

#### **Total Ionizing Dose (TID)**

 In an accelerated environment, characterizes parametric variations and the long-term radiation effects on the device and determines whether dose-rate sensitivity exists



### **Pros and Cons**

#### 2-photon Laser Testing

- Requires decapsulation/delidding, thinning and polishing
- Hard to draw direct comparisons to on-orbit performance
- Very focused energy deposition & visual localization of events

#### **Heavy-ion testing**

- Requires decapsulation/delidding and sometimes thinning
- High LET, good range, good coverage
- Best comparison to on-orbit performance

#### **Proton testing**

- Broad beam or collimated ~1-2cm<sup>2</sup>
- Limited LET, dependent on secondary interactions
- High accumulated dose, activation of test setup
- Simplest test preparation, may need to remove heat sink

#### **Total Ionizing Dose (TID)**

- Hard to isolate a single component
- Often requires large sample size, potentially lot and wafer dependent





# Coverage of Secondaries versus Heavy Ions

1E10 200 MeV protons/cm<sup>2</sup>



1E11 200 MeV protons/cm<sup>2</sup>



1E12 200 MeV protons/cm<sup>2</sup>





Coverage from 1E7 ions/cm<sup>2</sup>

Raymond L. Ladbury at the Single Event Effects (SEE) Symposium - Military and Aerospace Programmable Logic Devices (MAPLD) Workshop, La Jolla, CA, May 22-25, 2017



## **Test Setup**

- Things to consider in the test environment
  - Physical location of payload and results
  - Data paths upstream/downstream
  - Control of electrical sources
  - Temperature control in a vacuum or air
- Things to consider in the device under test (DUT)
  - Is the die accessible?
  - What functional blocks are accessible?
  - Which functions are independent of each other?
  - Does it have proprietary or open software?
  - Operating system daemons



# **Design Requirement: Cooling**

#### All processors that are >1W need a cooling solution

- Low power devices can use a passive cooler
- Mid to High power devices need active cooling (e.g., fan, water block)





# Designing a Standardized Cooler





### **Test Environment**

#### Beam line

- DUT testing zone where collateral damage can happen such as secondary neutrons
- Shielding for everything non-DUT

#### Operator Area

- Cables, node control and extenders
- Signal integrity at a distance > 50 ft
- "Everything that was done in a lab, in front of you on a bench, now must be done from a distance..."



# **System Mounting**





tripods, stands & open cases



# **Laser or X-ray Environment**





# **Heavy-Ion Environment**

Heavy Ion Setup at The Cyclotron Institute at Texas A&M University





### **Proton Environment**

Ted Wilcox (NASA GSFC) proton setup at Massachusetts General Hospital, Francis Burr Proton Facility





# **Neutron Test Setup**

Paolo Rech (UFRGS, LANL) neutron setup at ChipIR Beamline





# **Signal Integrity at a Distance**





#### **NEPP Standard Test Bench**





#### **Health Status**

- What nodes in the system are accessible nodes?
- Can we make a system-level watch dog to monitor heartbeats from:
  - Network
  - Busses
  - Peripherals (input and output)
  - Electrical states
- Can we monitor at a "good enough" resolution to identify latch-up events?



# Sampling using I2C hardware





# Sampling using Software





## Sampling the Power Mains





### Sampling at the Point of Load





### Plan A B C D





#### **Outline**

- NASA Electronic Part and Packaging Program (NEPP)
- High Performance Processing Components
- Application Environments and "Key Process Indicators"
- Design of Experiment
- Test Vectors
- Thoughts and Conclusions



### **Application Focused Payloads**

- Simulations
  - SDK Sample code
- Bit streams
  - Sensors or camera
  - Offline video feed
- Computational loading
  - LinPack
- Neural networks
  - Landsat image classification
- Display Buffer Output
  - RGBYWB Colors
  - Texture and Ray Tracing (Furmark)

- Encryption
  - SHA 256
- Benchmarks
- Easy Math
- Performance Corner tests
  - High/Low voltages
  - High/Low temperatures
  - Current limited



### **Software: Design Requirements**

- Does it need its own operating system?
- Instead of compiler optimizations for different device generations, can we create payloads and have normalization?
  - Can we run the same code on the previous generation and next generation of the device?



#### **Test Flow & Fault Tree**



These bins are the classification of error signatures that we are trying to have the device under test fall into.



## Silent Data Corruption vs Crash

#### **Soft Errors in:**

- -data cache
- -register files
- -logic gates (ALU)
- -scheduler

Silent Data Corruption:
Output is corrupted
(Can you tell?)

#### **Soft Errors in:**

- -instruction cache
- -scheduler / dispatcher
- -bus controller

Crashes and
Detected Unrecoverable Errors

# What is Critical? Sometimes it's Fuzzy

Consider object detection. If there is a fault but at least 50% of the object is detected, should we consider it an error?



How can errors be classified?

Precision: an object that is not present is detected (unnecessary vehicle stops)

Recall during Inference: an object is missed (potentially resulting in a collision)

A standardized test bench must go beyond basic electrical behavior, i.e. cache upsets, but must extend to an application layer (see above).



#### What does a failure look like?







# Latch up Signatures

#### nVidia TX2 in 200MeV protons





# Latch up Signatures

#### nVidia TX2 in 200MeV protons





#### **NEPP Standard Test Bench**





#### **Outline**

- NASA Electronic Part and Packaging Program (NEPP)
- High Performance Processing Components
- Application Environments and "Key Process Indicators"
- Design of Experiment
- Test Vectors
- Thoughts and Conclusions



# **Rapid Test Preparation**



AMD e9173 GPU (Clockwise from top right)

- 1) As Received
- 2) Without fansink
- 3) Without Heatsink
- 4) Underside
- 5) Render of Adapter Plate
- 6) Toolpath settings











#### Test Preparation:

- Software payloads are created offline
- Conduction cooling system is modular and portable
- Adapter plates are designed and fabricated in 3-5 business days



#### **DDRx Test Readiness**



Supply current /

monitoring

connects to

LabVIEW for

recording

Voltage **Bias Board** 

> SPD Read and Write using microcontroller



JTAG tap connects to bus and allows PC software to exercise DDRx

57

To be presented by Edward Wyrwas at the 12th Space Computing Conference in Pasadena, CA. July 30 - August 1, 2019



### **Thoughts**

- Remember that space radiation, and modulation due to space weather, affects the terrestrial radiation environment – alpha particles are an additional environment
- Invoke a tiered approach for radiation effects assurance and maintain awareness that there are unknown unknowns due to rapid technology evolution
- Explore additional synergies within the community we're grappling with the same challenges as more advanced technologies enter our systems (e.g., reliability, availability, supply chain, etc.)



### **Thoughts**

- NEPP and its partners have conducted proton, neutron and heavy ion testing on many devices
  - Have captured SEUs (SBU & MBU),
  - Have seen repeatable current spikes and latch up behavior,
  - Predominately have encountered system-based SEFIs
- Microprocessor and memory tests require a complex platform to arbitrate the test vectors, monitor the DUT (in multiple ways) and record data
  - None of these should require the DUT itself to reliably perform any other task outside of being exercised
- Every test is another learning experience and while improvements are always possible, preparation time may not be as abundant
- Prioritization during development is important



### **Conclusion**

- The NEPP microprocessor and GPU testing has been standardized:
  - rapid development of cooling system for each DUT form factor and packaging type
  - system implementation using modular COTS' system and network components
  - public domain software that has been excessively tested by the community
  - payloads that can be easily updated to accommodate new DUTs while maintaining the ability to test older DUTs



#### References

F. Irom, "Guideline for ground radiation testing of microprocessors in the space radiation environment," Jet Propulsion Laboratory, Tech. Rep. 13, 2008.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.561.3337&rep=rep1&type=pdf

M. Berg, "Field programmable gate array (FPGA) single event effect (SEE) radiation testing," NASA/Goddard Space Flight Center, Tech. Rep., 2012.

https://nepp.nasa.gov/files/23779/FPGA\_Radiation\_Test\_Guidelines\_2012.pdf

S. M. Guertin, "A guideline for SEE testing of SOC," in Single Event Effects Symposium, San Diego, CA, May 2013. https://trs.jpl.nasa.gov/handle/2014/45991

H. Quinn, "Challenges in testing complex systems," IEEE Trans. Nucl. Sci., Vol. 61, No. 2, pp. 766-786, 2014. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6786369

E. Wyrwas, "Body of Knowledge for Graphics Processing Units (GPUs)," NASA/Goddard Space Flight Center, Tech. Rep., 2018. http://hdl.handle.net/2060/20180006915



### **Acknowledgements**

- This work has been sponsored by NASA Electronic Parts and Packaging (NEPP) Program NASA Office of Safety & Mission Assurance
- Thanks is given to the NASA Goddard Space Flight Center's Radiation Effects and Analysis Group (REAG) for their technical assistance and support.