

## PhD in Computer and Control Engineering

XXIX cycle

# Reliability analysis of future digital systems

PhD Candidate:

### Vallero Alessandro

#### 1. Introduction / Context

As we move deeply into the era of nano-scale devices, reliability becomes a key challenge for the semiconductor industry. Failing to meet a reliability requirement may add excessive re-design costs to recover and may have severe consequences on the success of a product. Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power costs. The current practice is to rely either on time consuming gate-level fault injection campaigns or on simplified models that guarantee smaller computation time but deliver very coarse-grain and conservative (i.e. pessimistic) reports of the system reliability.

One solution is to apply different protection mechanisms at different layers of the system implementing what is nowadays called *cross-layer reliability* enhancement. Unfortunately, tools and models for this method are still at their early stages compared to other very mature design tools (e.g., performance and power optimization tools).

#### 2. Goal

This work proposes a novel system-level *cross-layer* reliability assessment framework:

- built on top of a component-based Bayesian model of the target system;
- modeling the interaction among the components the system;
- delivering reliability estimations accurately and quickly:
- guiding system designers by helping the selection of the best fault-tolerance mechanisms to reach the desired reliability requirements without overdesign the final system.

#### 3. Method

The proposed Bayesian system reliability model is composed of a qualitative model representing the architecture of the system and a quantitative model, representing the reliability of each component and their relations [1].

The qualitative model reflects the system architecture, defined through a directed acyclic graph. The set of vertices is split into two subsets: components and parameters. Components are blocks composing the system. Depending on the architectural layer (technology, HW, SW) the component definition changes. Components are associated to Bayesian nodes, i.e., their reliability is associated to a set of random variables. Parameters are special vertices that are not direct part of the Bayesian model. They represent implementation details of a component (e.g., operating temperature, workload, etc.) exploited by our framework to build the quantitative model of the system described later in this section. Arcs among component nodes define temporal or physical reliability relations among components, e.g., a failure state of a component may influence the state of another component. Finally, the arcs connecting parameter to component nodes model relations between a component and its implementation parameters. Based on the system stack, components of a system are split into four subsets or domains (Figure 1) each requiring different techniques to be characterized for reliability.



The quantitative model of the system defines the probability of occurrence of an error/fault in a component depending on the condition of its direct interacting components and on its implementation parameters. In a Bayesian model the quantitative model is a set of Conditional Probability Tables (Figure 1). Each node is associated to a set of states that identify potential error or error-free conditions of the node (e.g., a memory can be error free, or it can be affected by a single bitflip, or by a double bit-flip). The set of states of the nodes depends on the node domain and on the specific characteristics of the node. For each state of a node, we need to look at all combinations of states of its parent nodes.

Thanks to the properties of the proposed model, by applying Bayesian reasoning, the weakest components of the system can be identified. This capability has been exploited for the implementation of an automatic system optimization framework. An innovative multi-level extremal optimization algorithm is able to iteratively estimate the impact of the application of different fault-tolerant techniques to the reliability of the system. This enables to carefully optimize the system toward the best reliability minimizing the impact of the introduced protection mechanisms on power, area and performance, thus avoiding to overdesign the system.

#### 4. Results

To evaluate the capability of the proposed cross-layer reliability estimation framework several systems based on miBench applications were analyzed and results compared with those provided by state-of-the-art micro-architectural fault injectors. The reliability analysis targets several microprocessor hardware structures: L1/L2 cache, Register File and Load Store Queue. Results are reported in Figure 2.a, where reliability is expressed in terms of Architectural Vulnerability Factor (AVF), that is the probability of an error occurring in a



Figure 2.a : AVF computed by uA FI and the proposed Bayesian reliability analyzer



Figure 2.b: Hours of simulation required by uA FI and the proposed Bayesian

hardware structure to manifest as a fault at the output of the system. Figure 2.b shows the time required to perform the analysis expressed in hours of simulation. Results demonstrate that our analysis is accurate and fast even for complex industrial applications (FMS and Tsunami), making it suitable to be integrated into commercial application. Finally, the optimization process for a specific system is illustrated in Figure 3.a. Results are presented for reliability only constraint as well as in presence of other design constraints (Figure 3.b).

#### 5. Conclusions

The goals of this work were successfully achieved. Further works will address improvements for the reliability estimation accuracy and the time required by the analysis.

#### **6.References**

- 1. A. Vallero et al.; "Cross-Layer System Reliability Assessment Framework for Hardware Faults"; IEEE International Test Conference, 2016
- 2. A. Vallero et al.; "Cross-layer reliability evaluation, moving from the hardware architecture to the system level: A CLERECO EU project overview"; MICROPROCESSORS AND MICROSYSTEMS, vol. 39 n. 8, pp. 1204-1214



Figure 3.a :Reliability only optimization for FFT benchmark running on ARM® Cortex®-A15



Figure 3.b: Exploiting different cost functions different trade-offs between reliability, timing, power and area can be achieved.