

MILITARY TECHNICAL COLLEGE

CAIRO - EGYPT

# MODELING AND DIAGNOSABILITY OF FAULTY UNITS IN DIGITAL COMPUTER NETWORKS

By

Dr. M. E. ELBABLY

Dept. of Telecommunications & Electronics, Faculty of Eng. & Technology - Helwan University Helwan - Cairo - Egypt.

### Abstract

With a field replaceable-unit (FRU) (which contains more electronics chips) it is becoming uneconomical to throw a faulty FRU (module or card or board in digital electronic networks) away without trying to repair it (or at least salvage parts). Testability must therefore also include diagnosability, i.e., the capability for locating faults at least down to smallest-repair-replaceable-unit (SRRU) level, in distinction to the FRU. Without diagnosis, repair will be impossible.

Many researchers have contributed in diagnostic modeling of digital systems, also the chracterisation of different diagnostic methods have been established. Most of these works have been concentrated on the identification of faulty units only in digital systems.

In this research, the necessary and sufficient conditions which are required to identify the faulty parts using different diagnostic models, whether these faults in the system units (main units, which are work in normal mode) or in the comparator units (work in test mode) will be presented. Also, an algorithm for distributed diagnosis in computer networks will be proposed.

#### INTRODUCTION

Several papers [1-6] have been concerned with system level fault diagnosis and have had as an objective the determination of conditions for diagnosability. Interest in this topic is motivated by the need for highly-available digital systems that can continue operation, perhaps at reduced capacity, when multiple hardware failures occurs. The approach to such systems via reconfiguration or standby sparing requires that the presence of malfunctioning elements be detected and their location determined to within a system module, i.e., the system must be diagnosable.

Of all the models that have been proposed for fault diagnosis in multiprocessor systems, the most extensively studied is the PMC (symmetrical) model which was introduced by

Preparata, Metze and Chien [1]. This model describe a class of systems which can be decomposed into that are capable of testing each other. Fault diagnosis with PMC model was studied initially for the special case where all the faulty units in the system fail, the source of a fault condition being permanent. Testing is complete to the extent that any fault-free unit which tests a faulty unit will always detect its failed state. Clearly, such permanent failures and complete testing must be considered to be an exception.

In this model the system is usually partitioned into replaceable subsystems (chips, or modules, or cards, or boards). These subsystems need not be identical, although they must be powerful enough to test, either individually or in combination, other subsystems. Each test involves the controlled application of stimuli and the observation of the corresponding responses. In case of digital networks instructions or micro-operations as in ESS (Electronic Switching System). or in the self diagnosable computers.

On the basis of the responses to the stimuli, the outcome of the test may be classified simply as "pass" or "fail" or detailed information about the nature of the failure may be retained for further analysis. In either case, the testing unit evaluates the tested unit as either fault-free, or faulty. Of course, this evaluation is meaningful only if the testing unit is itself fault free, otherwise the test outcome is unreliable. If each part of the system is tested in some sequence, the combined test outcomes can be exhibited naturally as a direct graph (digraph) with binary weights. In this graph each unit vi of the system will be a node of the graph, and the presence of a testing link eij signifies the fact that there is a test in which vi evaluatesj. The weight associated with  $eij=\{0,1\}$ ; wij is "0" if both vi and vj are fault-free; wij is "1" if vj is faulty. In the case that vi is faulty, the test outcome is unreliable and wij can be assumed to be wij=x, i.e., either of the values "0" or "1", regardless of the status of vj. Asymmetric and symmetric invalidation assumptions differ only when a faulty unit is involved in a test on another faulty unit. Asymmetric invalidation means that such a test always fails, whereas symmetric invalidation means that due to the faults, any test outcome is possible. asymmetrical assumption is clearly more conservative. The advantage of the asymmetric assumption is that for a given system it often leads to significantly higher diagnosability. Much research work has been done based on the above two assumptions [1,7], the most important of which was formulated by Russell and Kime [8]. Chwa and Hakimi [4] were the first to formulate a diagnostic model (comparison model) which considered as a modification of PMC (symmetrical) and asymmetrical model, in an attempt to have a more realistic representation of systems whose units have a rather complex structure. This model has become the most important model to use in system-level diagnosis, particularly with the great advancement in testability design today.

In comparison diagnostic model, the weight of the edge eij, denoted by w(vi,vj) or wij, is "0" if the status of both units vi and vj are fault-free, and is "1" if unit vi or vj or both vi

and vj are faulty according to the comparison (CH [4]) model. Test invalidations are not possible with this model. In this paper the necessary and sufficient conditions, for system level diagnosis will be investigated, using different diagnostic model. This investigation to identify the faulty units whether its among the main units (which works during normal operation mode) or among the comparator units (works only during test mode).

# Diagnosis and fault identification

A t-diagnosable system with faulty comparators can be explained as follows.

## Definition 1

A t-tc diagnosable system is a system in which the status of all the units can be determined provided that the number of faulty units is at most t (diagnosability) and the number of faulty comparators is at most tc (comparability). The above definition can be used to derive the characterization theorem for t-tc diagnosis.

#### Theorem 1

Let G=(V,E) be the undirected graph representing the comparison assignment for a system S made of n units; S is t-tc diagnosable if and only if S is t-diagnosable and  $\Gamma(vi) > tc+t$  for every i=j, vi,  $vj \notin V$ , (where  $\Gamma(vi)$  is the number of units compared with the unit vi).

#### Proof:

The condition is necessary. Assure that the faulty comparators to are connected between the unit vi and other to fault free units. The rest of units (F(vi)-tc) which are connected with vi are itself faulty. Then the status of vi cannot be identified and this will lead to misdiagnosis of vi if  $(vi) \ge t$  only.



In G a unit vi can be diagnosed appropriately if the majority of comparisons are correct, under this condition  $\{\Gamma'(vi)\}\} = t + tc.$ 

# Extension to the comparison model

The analysis presented in the preceding section is convenient to the PMC [1], BGM [3] and comparison models [9]. In this section the analysis is extended to the comparison model (KH2).

The KH2 model is characterized by no invalidation. Note that the diagnosability in the KH2 model is given by a larger number of units, i.e n>=t+2. This restricts the chracterisation of t-tc diagnosable system because comparison invalidation introduces an ambiguity in the comparison assignment. This reduces the number of simultaneously faulty units with respect to n. With other models (PMC, BGM, KH1) where n>=2t+1 there is no such a restriction with t-tc diagnosable system if  $\Gamma$  (vi)>=t+tc.The new chracterisation for the comparison or KH2 model in t-tc diagnosable system is given by the following theorem.

# Theorem 2

In comparison model a system is t-tc diagnosable if and only if  $n \ge t + tc + 2$ .

### Proof

This theorem can be proved by stating that in the previous diagnosability (t<=n-2) is not applicable under comparator invalidation due to faulty comparator between the faulty two fault-free units in a t-tc diagnosable system with n=t+2 can cause the only 0-weighted edge to be 1. This precludes every possibility of correct fault identification. The bound for diagnosability as in the KH1 model (n>=2t+1) applies by the same Using theorems 1 and 2

Using theorems 1 and 2, a new criterion has been found for diagnosis by comparison: by assuming failure of comparators the nature of any diagnostic model is equivalent.

The different relation between number of units in S and the diagnosability in the diagnostic models [10] will be unified here in an unique condition n>=t+tc+2. This condition is unique because invalidation is not only dependent on the diagnosability, but also on the comparability of the system.

# Diagnostic algorithm

#### Notations

There are two binary lists F(vi) and C(vi);

\* F(vi) is defined as the units list; the jth entry, fj is an element in F(vi) and is set to 1 if and only if all the others

entries are rest. It is set to 0 if and only if:

- # if vi is fault-free;
- # vi compared with itself;
  - # comparator unit between vi and vj is fault-free.

For the comparator list C(vi); the jth entry, Cj is an element in C(vi) and is set to "1" if and only if:

- # vi and vj'one of them is faulty;
- # vi, j are fault-free units and cij between vi and vj
  is faulty;
- # all the other entries are 1's.

It is set to "O" if and only if:

- # vi, vj and Cij all are fault-free units;
- # there is no compare unit (link) between vi and vj, Cij =dont care (x), and then change the x's to "0" in the

analysis procedures.

# Algorithm procedures

Multi-rounds will be performed to get final version of the unit list F, on which the faulty units can be identified. Also, the status of faulty comparator units:

- a) Select two arrays [S1], [S2] and initialize them to 0's.
- b) Select the first two comparison lists C(i), C(i+1) and examined both of them:-
  - # if both of them contains a group of "0's" and "1's" do ORing (logic OR) between the units lists F(i) and F(i+1),
  - then put the result in the IS11 array; # if one of C(i) (or C(i+1)) contains \"1's" in all positions take the result of unit list as F(i+1) (or F(i));
    - # if both lists C(i) and C(i+1) all their elements are "l's" the contents of resulting unit list will be "0's".
- c) Increment i=i+2, and repeat step (b) till the comparator lists vanish.

4

- d) Do ORing between between all the elements (column by column) of the array (S1) to get just one row at the final round.
- e) The faulty units can be identified from the final form of ISII which obtained from step (d). These faulty units can be indicated from the positions of the "1's".

  \*\* change the one's in the positions which are the same numbers of faulty units to zero's.
- f) The remaining one in any comparator lists mean the comparator unit between the unit indexed by the comparator list and the unit indexed by the position of that one is faulty.
- g) Increment j=j+1 and repeat step (f) till the rows in \ \sigma \sigma \ \text{vanish.}
- h) Stop.

# Conclusions

The research of self fault diagnosis began with the selection of convenient diagnostic models at system level. Depending on the existing construction of the comparison units (links) between the system units, the system (board, card, module) can be diagnosed, using the algorithm which has been proposed here. This mean the faulty units whether in the main system units or in the comparator units. i.e., the system will be diagnosable and comparable system.

#### References

- [1] F.P.Preparata. G.Metze. and R.T. Chien, "On the connection assignment problem of diagnosable systems". IEEE Trans. on
- Electronics Computers. Vol. EC-16, pp. 848-854, Dec.1967.
- [2] A.Avizient, G.C.Gilley, F.P. Mthur, A.A. Rennels, J.A.Rohr, and D.K. Rubin, "The STAR (self Testing And Repair) computer: an investigation of the theory and practice of fault-tolerant computer design", IEEE Trans. Computers, Vol. C-20, pp. 1312-
- 1321, Nov. 1971.
- [3] F. Barsi, F.Grandoni, and P. Maestrini, "A theory of diagnosability without repair", IEEE Trans. Comput., Vol. C-25, pp. 585-593. June 1976.
- [4] K. y. Chwa and S. L. Hakimi, "On fault identification in

diagnosable systems", IEEE Trans. Comput., Vol. C-30, pp. 414-422, June 1981.

- [5] A.T. Dahbura, K. K. Sabhani and L. L. King, "The comparison approach to multi processor fault diagnosis", proc. FTCS 15, pp. 260-265. Ann arbor, 1985.
- [6] A. K. Somani, V. K. AGarwal, and D. Avis, "A generalized theory for system level diagnosis", IEEE Trans. on Comput., Vol. C-36, No. 15, pp. 538-546, May 1987.
- [7] C. R. Kime, "An analysis model for digital system diagnosis", IEEE TRans. on Comput., Vol. C-19, pp. 1063-1073, Nov. 1970.
- [8] J. D. Russell and C. R. Kime, "On the diagnosability of digital systems", in Dig. Int. Symp. Fault Tolerant Computers, Palo Alto, CA, pp. 139-144, June 1973.
- [9] Kretzer. S. E. and S. L. Hakimi, " Adaptive fault identification in two diagnostic models", Proc.21st Annual

Alerton Conf. on Comm. Control and Comp., pp. 353-362, 1983.

[10] M.E. Elbably, "On the testability and diagnosability of digital systems". Ph.D thesis, Brunel University, The University of west London, Uxbridge- London-U.K, Feb. 1988.

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*