1



#### **Graphics Processor Units (GPUs)**

Edward J Wyrwas edward.j.wyrwas@nasa.gov 301-286-5213 Lentech, Inc. in support of NEPP

#### Acknowledgment:

This work was sponsored by: NASA Electronic Parts and Packaging (NEPP) Program



### Acronyms

| Acronym | Definition                               |  |  |  |  |  |  |  |
|---------|------------------------------------------|--|--|--|--|--|--|--|
| BOK     | Body of Knowledge (document)             |  |  |  |  |  |  |  |
| CUDA    | Compute Unified Device Architecture      |  |  |  |  |  |  |  |
| DUT     | Device Under Test                        |  |  |  |  |  |  |  |
| GPGPU   | General Purpose Graphics Processing Unit |  |  |  |  |  |  |  |
| GPU     | Graphics Processing Unit                 |  |  |  |  |  |  |  |
| MBU     | Multi-Bit Upset                          |  |  |  |  |  |  |  |
| MGH     | Massachusetts General Hospital           |  |  |  |  |  |  |  |
| NEPP    | NASA Electronic Parts and Packaging      |  |  |  |  |  |  |  |
| PTX     | Parallel Thread Execution                |  |  |  |  |  |  |  |
| RTOS    | Real Time Operating System               |  |  |  |  |  |  |  |
| SBU     | Single-Bit Upset                         |  |  |  |  |  |  |  |
| SEE     | Single Event Effect                      |  |  |  |  |  |  |  |
| SEFI    | Single Event Functional Interrupt        |  |  |  |  |  |  |  |
| SEU     | Single Event Upset                       |  |  |  |  |  |  |  |
| SIMD    | Single Instruction Multiple Data         |  |  |  |  |  |  |  |
| SoC     | System on Chip                           |  |  |  |  |  |  |  |
| TID     | Total Ionizing Dose                      |  |  |  |  |  |  |  |



## Outline

- What the technology is (and isn't)
- Our tasks and their purpose
  - The setup around the test setup
  - Parametric considerations
  - Lessons learned
- Collaborations
  - Roadmap
  - Partners
  - Results to date
  - Plans
- Comments



## Technology

- Graphics Processing Units (GPU) & General Purpose Graphics Processing Units (GPGPU) are considered compute devices that behave like coprocessors
  - Take assignments from another device
  - Inability to load and execute code on boot by itself
- Using high-level languages, GPU-accelerated applications run the sequential part of their workload on the CPU – which is optimized for single-threaded performance – while accelerating parallel processing on the GPU.



#### Purpose

- GPUs are best used for single instructionmultiple data (SIMD) parallelism
  - Perfect for breaking apart a large data set into smaller pieces and processing those pieces in parallel
- Key computation pieces of mission applications can be computed using this technique
  - Sensor and science instrument input
  - Object tracking and obstacle identification
  - Algorithm convergence (neural network)
  - Image processing
  - Data compression algorithms



## **Device Selection**

 Unfortunately, GPUs come in multiple types, acting as primary processor (SoC) and coprocessor (GPU)



#### Nvidia TX1 SoC







#### **Intel Skylake Processor**





## **Device Software**

- Does it need its own operating system?
  - E.g. Linux, Android, RTOS
- Can we just push code at it?
  - E.g. Assembly, PTX, C
- Payload normalization
  - Can we run the same code on the previous generation and next generation of the device?
  - Cannot with CUDA code; can with OpenCL

Real-time Operating System (RTOS) Parallel Thread Execution (PTX) CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia



## **Payloads**

- Visual Simulations
  - Sample code
  - Fuzzy Donut (i.e. Furmark)
- Sensor streams
  - Camera feed
  - Offline video feed
- Computational loading
  - Scientific computing models
- Easy Math
  - 0 + 0 … wait … should = 0





### **Test Setup**

- Things to consider in the test environment
  - Operating system daemons
  - Location of payload and results
  - Data paths upstream/downstream
  - Control of electrical sources
  - Temperature control (i.e. heaters) in a vacuum
- Things to consider in the DUT
  - Is the die accessible?
  - What functional blocks are accessible?
  - Which functions are independent of each other?
  - Does it have proprietary or open software?



## **Test Environment**

- Beam line
  - DUT testing zone where collateral damage can happen
  - Shielding for everything non-DUT
- Operator Area
  - Cables, interconnects and extenders
  - Signal integrity at a distance
  - "Everything that was done in a lab, in front of you on a bench, now must be done from a distance..."



To be presented by Edward Wyrwas at the NASA Electronics Parts and Packaging (NEPP) Electronics Technology Workshop (ETW), Greenbelt, MD, June 26-29, 2017



## **Test Environment (Cont'd)**



**Tripod and mounting** 

**External power** 

**Power injection** 

#### Arrows and circle mark locations of the lead and acrylic block fortresses

Pictures are from Massachusetts General Hospital Francis Burr Proton Facility



## **Test Environment (Cont'd)**





## **DUT Health Status**

- Accessible nodes
  - Network
    - Heart beat by inbound ping
    - Heart beat by timestamp upload
  - Peripherals response
    - "Num lock"
  - Visual check
    - Remote
    - Local
    - Local with remote viewing
  - Electrical states
    - At the system
    - At the DUT



## **Monitoring Data**



To be presented by Edward Wyrwas at the NASA Electronics Parts and Packaging (NEPP) Electronics Technology Workshop (ETW), Greenbelt, MD, June 26-29, 2017



# Monitoring Data (Cont'd)

- Significant digits are important
- Resolution is needed for correlation
  - Faster sampling speed
  - Smaller units (µV or mV, not Volts)





# Monitoring Data (Cont'd)

#### • Even better (albeit being a mock up):

| ontinuous Po      | ower Supply I     | Monitor Rev 1     | .0-rel EXIT                  | Ch1 800m-                                | 635.501m | LOGGING           | ŝ.    |
|-------------------|-------------------|-------------------|------------------------------|------------------------------------------|----------|-------------------|-------|
| S Type Simulation | Save<br>Settings  | Reset             | Leave Pwr ON<br>After Latch? | Mon 600m-<br>En Latch 1 400m-            |          |                   | Por   |
| SET               | Recall Settings   | Status?           | Master Latch Control         | Latch 1 200m-                            | $\sim$   | 10-25-20 Heat Mo  | OF    |
| HANNEL 1          | CHANNEL 2         | CHANNEL 3         | CHANNEL 4                    |                                          |          | 19:25:29 Heat Mo  | ON    |
| Voltage (V)       | Voltage (V)       | Voltage (V)       | Voltage (V)                  | Ch 2 1-<br>Mon 750m-<br>En Latch 2 500m- | 633.372m | 3                 | OFF   |
| OVP (V)           | OVP (V)           | OVP (V)           | OVP (V)                      | Latch 2 250m-                            |          |                   | _     |
| Current Limit (A) | Current Limit (A) | Current Limit (A) | Current Limit (A)            | 0-                                       |          | 19:25:28          |       |
| 0CP               | 1<br>OCP          | 0CP               | 1<br>OCP                     | Ch3 1-                                   |          |                   |       |
| Disable           | Disable           | Disable           | Disable                      | 800m -<br>Mon 600m -                     | 414.803m | Digita            | al IO |
| OCP Delay         | OCP Delay         | OCP Delay         | OCP Delay                    | En Latch 3 400m -                        |          | 1                 |       |
|                   | 90001             |                   | [ grown                      | Latch 3 200m-<br>0-<br>19-25-24          | $\sim$   | 2                 | 2     |
| DUT               | CH 2<br>OUT       | CH 3<br>OUT       | CH 4<br>OUT                  |                                          |          | 19:25:29          | 5     |
|                   |                   |                   |                              | Ch 4 0 100u-                             | 0.000    | 5 6               | 5     |
| EN 1              | EN 2              | EN 3              | EN 4                         | Mon -200m-<br>En Latch 4 -500m-          |          | 6                 |       |
| ALLO              |                   |                   | ALL OFF                      | Latch 4 -750m-                           |          | 7 6               |       |
|                   |                   |                   |                              | -1-                                      |          | 19:00:05 Beam Cor |       |



## What does a failure look like?



To be presented by Edward Wyrwas at the NASA Electronics Parts and Packaging (NEPP) Electronics Technology Workshop (ETW), Greenbelt, MD, June 26-29, 2017



## Failures (Cont'd)





## **Learning Experience**

- Every test is another learning experience
  - "Is the laser alignment jig in the beam path..."
  - Nuances with controllable nodes
    - DUT power switch
    - Remote power sources
    - DUT electrical isolation from test platform
    - Thermal paths
  - Improvements are always possible, but preparation time may not be as abundant
  - Prioritization during development is important
    - Software payload
    - Hardware monitoring
    - Remote troubleshooting capabilities



## **GPU Roadmap**

- collaborative with NSWC Crane, others



To be presented by Edward Wyrwas at the NASA Electronics Parts and Packaging (NEPP) Electronics Technology Workshop (ETW), Greenbelt, MD, June 26-29, 2017



#### **Partners**

- Navy Crane
  - Conducting testing on Nvidia 14nm GPUs
- Collaboration with partners is yielding a comprehensive test suite
  - L1 and L2 cache
  - Registers
  - Shared, Internal, Texture and Global memory
  - Control logic



## **Qualification Guidance**

#### - Creation of GPU Body of Knowledge (BoK) document

- Technology
  - Silicon
  - Packaging
  - Heterogeneous constituents
- Reliability
  - Semiconductor mechanisms
  - Package issues
  - Scaling issues
- Failure categories and trends
- Software & Hardware sources

#### Future guidelines will be developed for this technology to include qualification and test methods



## **Results to Date**

- Developing software for cross platform use
  - Nvidia Tegra X SoC ARM with embedded Linux
  - Nvidia GPUs GPU for x86 Windows and Linux
  - Intel Skylake Processor IP Block for x86 Linux
  - Qualcomm Adreno & Mali GPU IP Block for ARM Linux
- Proton test result ranges are dependent on physical target within DUT
  - Cross section (σ, cm<sup>2</sup>): 1x10<sup>-7</sup> to 9x10<sup>-9</sup>
  - Flux (p/cm<sup>2</sup>/sec): 1x10<sup>6</sup> to 7x10<sup>6</sup>



# Plans (w Schedule)

- More proton testing on 14nm GPUs
  - Test OpenCL payloads
  - Test L1, L2, registers, shared memory & control logic
  - Record die temperature, 12V and 3.3V rail voltages and currents, system events (and observations)
- Two proton test sessions and significant in-lab work has permitted improvements to:
  - Thermal-electrical monitoring of the DUTs though some more improvements are necessary to achieve the desired resolution
  - Proving out which code libraries won't work for the type of testing we're conducting



## FY17-18: GPU Testing

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |           | _          | _ | _ | _ | _ | _   | _  | _ |   | _              |                                                        |                                                                                                                                                                                                                                                                                 |  |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|------------|---|---|---|---|-----|----|---|---|----------------|--------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| Description:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |           |            |   |   |   |   |     |    |   |   | FY17-18 Plans: |                                                        |                                                                                                                                                                                                                                                                                 |  |  |
| <ul> <li>This is a task over all device topologies and process</li> <li>The intent is to determine inherent radiation tolerance and sensitivities</li> <li>Identify challenges for future radiation hardening efforts</li> <li>Investigate new failure modes and effects</li> <li>Testing includes total dose, single event (proton) and reliability Test vehicles will include a GPU devices from nVidia and other vendors as available         <ul> <li>Compare to previous generations</li> <li>Investigate failure modes/compensation for increased power consumption</li> </ul> </li> </ul> |           |            |   |   |   |   |     |    |   |   |                | her                                                    | <ul> <li>Continue development of universal test suite</li> <li>Probable test structures for SEE: <ul> <li>Nvidia (16, 14, 10nm)</li> <li>AMD (14nm)</li> <li>Intel (14nm)</li> </ul> </li> <li>Tests: <ul> <li>characterization pre, during and post-rad</li> </ul> </li> </ul> |  |  |
| Schedule:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |           |            |   |   |   |   |     |    |   |   |                |                                                        | Deliverables:                                                                                                                                                                                                                                                                   |  |  |
| Microelectronics                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | FY17 FY18 |            |   |   |   |   | FY1 | 18 |   |   |                | <ul> <li>Test reports and quarterly reports</li> </ul> |                                                                                                                                                                                                                                                                                 |  |  |
| T&E                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | М         | J          | J | Α | S | 0 | Ν   | D  | J | F | М              | Α                                                      | <ul> <li>Expected submissions for publications</li> </ul>                                                                                                                                                                                                                       |  |  |
| On-going discussions for test samples                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |           |            |   |   |   |   |     |    |   |   |                |                                                        |                                                                                                                                                                                                                                                                                 |  |  |
| GPU Test Development                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |           | $\diamond$ | > |   |   |   |     |    |   |   |                |                                                        |                                                                                                                                                                                                                                                                                 |  |  |
| SEE Testing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |           |            |   |   |   |   |     |    |   |   |                |                                                        | NASA and Non-NASA Organizations/Procurements:                                                                                                                                                                                                                                   |  |  |
| Analysis and Comparison                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |           |            |   |   |   |   |     |    |   |   |                | $\diamond$                                             | <ul> <li>Source procurements: Proton (MGH), TID (GSFC)</li> </ul>                                                                                                                                                                                                               |  |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |           |            |   |   |   |   |     |    |   |   |                |                                                        |                                                                                                                                                                                                                                                                                 |  |  |

#### Pls: GSFC/Lentech/Wyrwas

To be presented by Edward Wyrwas at the NASA Electronics Parts and Packaging (NEPP) Electronics Technology Workshop (ETW), Greenbelt, MD, June 26-



### Conclusion

- NEPP and its partners have conducted proton, neutron and heavy ion testing on several devices
  - Have captured SEUs (SBU & MBU),
  - Have seen traceable current spikes,
  - But predominately have encountered system-based SEFIs
- GPU testing requires a complex platform to arbitrate the test vectors, monitor the DUT (in multiple ways) and record data
  - None of these should require the DUT itself to reliably perform a task outside of being exercised
- Progress has been made in proving out multiple ways to simulate and enumerate activity on the DUT
  - Narrowing down on a universal test bench
  - End goal is to make test code platform independent



## Acknowledgement

- Ken LaBel, NASA GSFC NEPP
- Martha O'Bryan, ASRC Space & Defense
- Carl Szabo, ASRC Space & Defense
- Steve Guertin, NASA JPL
- Adam Duncan, Navy Crane