

# HW-SW Fault Tolerance Design Techniques for Systems on Programmable Devices

PhD Candidate:

Corrado De Sio

### **1.Introduction**

Hardware-accelerated solutions are mandatory to meet high-performance requirements, while fault modeling and analysis are essential to delivering faulttolerant safety-critical systems.

exploiting the reconfigurability feature of the device, acting at the bitstream level.

#### 4. Results

The proposed platform makes it possible to assess the resilience of the hardware implementation at the microarchitectural level, which cannot be provided by softwarebased approaches.



Conceptual Hardware and Software Stack for hardware-accelerated AI applications 2.Goal

This research proposes a methodology for evaluating the resilience of neural network combining programmable systems, hardware and software reliability analysis.

## **3.Proposed Methodology**

Heterogenous System-on-chips, providing hardware and software programmability, are proposed for evaluating the effects of hardware faults.

ZYNQ PLATFORM





Conceptual view of hardware fault emulation leading to misclassification

For the hardware accelerators analyzed, the open-routing fault model resulted in a high degradation of confidence that often led to misclassifications, especially compared to software-level analysis results, which can then produce an overestimation of the true reliability of the system.

| RELIABILITY EVA | ALUATION OF AN ALEX | NET CONVOLUTIONAL LAYER |
|-----------------|---------------------|-------------------------|
|                 | Software Fault      | Unbrid based Fault      |

| Method       | Software Fault<br>Injection |                | Hybrid-based Fault<br>Emulation ( <i>FireNN</i> ) |                 | Method       | Method Hybrid-based Faul<br>Emulation ( <i>FireNN</i> |                 |
|--------------|-----------------------------|----------------|---------------------------------------------------|-----------------|--------------|-------------------------------------------------------|-----------------|
| Fault Model  | SEU in<br>Weights           | SEU in<br>Data | SEU in Conf.<br>Memory                            | Open<br>Routing | Fault Model  | SEU in Conf.<br>Memory                                | Open<br>Routing |
| Error Rate   | 40.57%                      | 46.16%         | 11.05%                                            | 59.62%          | Error Rate   | 12.93%                                                | 60.38%          |
| Failure Rate | 2.10%                       | 15.48%         | 5.12%                                             | 40.07%          | Failure Rate | 5.81%                                                 | 42.17%          |
| Fail./Err.   | 5.18%                       | 33.53%         | 46.33%                                            | 67.21%          | Fail./Err.   | 44.93%                                                | 69.84%          |
| Timeouts     | 0%                          | 0%             | 0.40%                                             | 2.78%           | Timeouts     | 0.51%                                                 | 2.86%           |

**RELIABILITY EVALUATION OF A RESNET-18** CONVOLUTIONAL LAYER

| RELIABILITT EVALUATION OF AN ALEANET CONVOLUTIONAL LATER |                             |                |                                                   | <br>CONVOLUTIONAL LATER |              |                                                   |                 |
|----------------------------------------------------------|-----------------------------|----------------|---------------------------------------------------|-------------------------|--------------|---------------------------------------------------|-----------------|
| Method                                                   | Software Fault<br>Injection |                | Hybrid-based Fault<br>Emulation ( <i>FireNN</i> ) |                         | Method       | Hybrid-based Fault<br>Emulation ( <i>FireNN</i> ) |                 |
| Fault Model                                              | SEU in<br>Weights           | SEU in<br>Data | SEU in Conf.<br>Memory                            | Open<br>Routing         | Fault Model  | SEU in Conf.<br>Memory                            | Open<br>Routing |
| Error Rate                                               | 40.57%                      | 46.16%         | 11.05%                                            | 59.62%                  | Error Rate   | 12.93%                                            | 60.38%          |
| Failure Rate                                             | 2.10%                       | 15.48%         | 5.12%                                             | 40.07%                  | Failure Rate | 5.81%                                             | 42.17%          |
| Fail./Err.                                               | 5.18%                       | 33.53%         | 46.33%                                            | 67.21%                  | Fail./Err.   | 44.93%                                            | 69.84%          |
| Timeouts                                                 | 0%                          | 0%             | 0.40%                                             | 2.78%                   | Timeouts     | 0.51%                                             | 2.86%           |

Architectural view of the proposed platform

Programmable hardware used **1S** to accelerator's emulate the hardware meanwhile providing architecture, the mechanism for emulating hardware faults

#### **5. References**

- 1. B. Du, S. Azimi, C. De Sio, L. Bozzoli and L. Sterpone, "On the Reliability of Convolutional Neural Network Implementation on SRAM-based FPGA," 2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2019, pp. 1-6, doi: 10.1109/DFT.2019.8875362.
- C. De Sio, S. Azimi and L. Sterpone, "An Emulation Platform for Evaluating the Reliability 2. of Deep Neural Networks," 2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology *Systems* (*DFT*), 2020, pp. 1-4, doi: 10.1109/DFT50435.2020.9250872.
- 3. C. De Sio, S. Azimi and L. Sterpone, "FireNN: Neural Networks Reliability Evaluation on Hybrid Platforms," in *IEEE Transactions on Emerging Topics in Computing*, vol. 10, no. 2, pp. 549-563, 1 April-June 2022, doi: 10.1109/TETC.2022.3152668.