TÜV SÜD has more than 20 years of experience in testing and certifying functional safety related systems and components. During this time our employees have observed recurring issues arising from misunderstandings of some functional safety concepts. The list below highlights some of the most interesting errors everyone should be aware of.
Often manufacturers just calculate a PFD/PFH value for their system or subsystem and claim afterwards a SIL for it. The PFH/PFD express the probability of a dangerous failure of a safety related system (or sub-system) per hour (PFH) or on demand (PFD). Both values address random hardware faults and usually are calculated with use of FMEDAs. Fulfilling a specific Safety Integrity Level (SIL) requires not only the control of random failures of hardware but also the avoidance and control of systematic failures in hardware and software. The latter is expressed as Systematic Capability (SC, values from 1 to 4, corresponding to the four SIL values) and reflects methods and techniques used during development of the safety related system. Therefore, a SIL always consists of both: PFD/PFH and a determination of the robustness of its development process i.e. the SC.
Sometimes system integrators and plant manufacturers require their suppliers to deliver control systems for normal operation with a SIL, thinking that this will ensure a certain reliability of the control system and/or utility. But the aim of a safety function (which is performed by a safety related system) is to put an Equipment Under Control (EUC) into a safe state not to increase availability.
A safe state of a EUC is a result of the hazard and risk analysis and depends on its different operational modes. In this context, we frequently observe misunderstandings about strategies and concepts regarding fail safe scenarios, for example shutting down the EUC in case of a failure, and fail operational scenarios, i.e. e.g. keep the EUC as much as possible in operation.
Random hardware failures (and reliability) are calculated based on failure rates. In terms of functional safety failure rates are split into safe and dangerous failures. Only dangerous failures (preventing the safety function to perform as intended) are considered in the calculation of the PFD/PFH values. A SIL therefore is only a degree of reliability that the safety function will perform as intended when it is required to put the EUC in a safe state.
Watchdogs of microcontroller (µC) often just reset the controller but do not control any outputs in a direct independent way. In case of a fault inside the µC a deterministic behaviour of the output is required. It is not enough to use an internal watchdog, because it is not guaranteed that the output will go into the desired state: the internal watchdog is part of the defect microcontroller, therefore it cannot be guaranteed that it works correctly.
We still observe approaches to claim systematic capability for existing SW based on operational experience (Route 2S in IEC 61508). Since IEC TS 61508-3-1 all individual failures of the SW need to be detected and reported during the observation period and also all combinations of input data, sequences of execution and timing relations must be documented. This approach is usually not possible from a practical point of view.
We usually observe system designs where non-safety parts/components/design elements are mixed with safety ones. This is a standard approach to reduce complexity and system costs. However, evidence for freedom of interference is required when mixed safety integrity level design techniques are used. The analysis shall be performed at detailed hardware or software level (depending on the system), considering any possible failure modes of the non-safety parts and the related impact on the safety one (i.e. overvoltage, wrong data, short circuit in not safety parts, etc.).
In the development of safety critical software, the usage of automated tools is nowadays state of the art in terms of efficiency and safety. Several times we have seen however some wrong usage of such tools. The unit test specification shall be written firstly based on the software unit/module specification and not only by white-box analysis of the source code. The goal is to test the desired functional behaviour of the software unit. The fulfilment of the evidence of the required test coverage at code level, for which white-box analysis can be required, shall be provided as well, but is not the only goal of unit test.
In modern complex safety relevant systems, the usage of smart sensors is getting more and more importance. The functions of the sensors shall be provided with the required integrity or the freedom of interference to the overall safety function shall be justified. Some system designs are wrongly based on the assumption that the failure modes of the complex smart sensors (i.e. with logic processing unit, communication protocol, etc.) can be diagnosed and mitigated by some external safety unit within the required diagnostic coverage and failure detection/reaction time.
We see recurrent issues related to missing protection/supervision of power supply. The safety standards do not specify how overvoltage/undervoltage measures shall be implemented in detail, but suitable mechanisms for protection shall be in place depending on the required SIL level. Usage of not redundant overvoltage protection mechanism for architecture requiring HFT>1 is for example a typical recurring issue. Supply voltage monitoring (including the required reaction like switch-off and safe state) should be done by a separate component which is capable of a wider voltage range, and has thus not been affected by the over/undervoltage condition. Even short overvoltage might have permanently damaged parts of the microcontroller, without being obvious. Thus, a reset of the microcontroller after overvoltage is not a suitable measure; after undervoltage condition a reset can be ok, as long as the undervoltage is not persisting.
The overtemperature detection and the respective reaction (enter safe state, switch-off,…) cannot be done by a component (e.g. microcontroller) which is right operated outside the specified ranges in case of overtemperature condition. A simple reset is also not helpful here, it has to be assumed that the overtemperature condition is persisting, or is already present during switch-on. Low temperature is sometimes not considered by safety standards (i.e. EN50129), but the application conditions (user manual) should make sure that the device is only driven within specified temperature ranges (for all components).
The abbreviation SIL (Safety Integrity Level) is used by several standards. In each standard, it relates however to different process, architectural and technical requirements. Some of the standards use different abbreviations like ASIL (Automotive SIL). Some others use the same abbreviation (SIL), with the risk of confusion and misunderstandings.
For example the SAS level (german:“Sicherheitsanforderungsstufe” for Safey Integrity Level) as it is used by SIRF (Sicherheitsrichtlinie Fahrzeug; EBA) is not the same as SIL level (“Safety Integrity Level”) as used by EN 50128 /EN 50129. SIRF (SAS) only uses part of the measures listed in appendix E of EN 50129, less organizational and process measures, and no defined limit for the hazard rate. SIL in the context of IEC 61508 is different than SIL in the context of EN50129. Sometime more complexity is added by defining SIL as Software Integrity Level instead of Safety Integrity Level (see e.g. EN 50657).
It is not possible to raise the SIL level of a system combining systems having no safety evidence or using homogeneous redundancy. For example, it is not possible to increase the SIL to SIL4 by simply combining several SIL2 systems/channels (e.g. reach SIL3 on system level by several SIL2 Control Systems). One reason is that for example SIL2 software processes do not prevent systematic software faults in the same way (integrity) as required for SIL4 ones. The IEC 61508 standard allows to increase the systematic capability by 1 if two independent systems are used in a dual channel structure. In case of homogeneous redundancy this is not possible. Other standards (i.e. EN 50129) do not provide this possibility.
Common cause is often forgotten in Fault Tree Analysis (FTA). Failures which can affect several parts of a system at the same time (e.g. loss of power supply) must be included as basic events in all relevant branches of the Fault Tree. Alternative: implement Common cause by using a “Beta Factor” in FTA tool.
The definition of SIL (safety integrity level) applies to functions that have been in origin derived by some kind of risk analysis/classification. For a system/component it is common to hear sentence like “a SIL 2 controller”, “a SIL3 brake system”, etc. Correctly speaking the term SIL is only related to specific (safety) functions. Which functions are safety ones is the goal of the risk analysis. A more correct definition is that a “system is capable to implement safety functions up to SILx”. This means that the process, architectural and technical requirements of the standards are fulfilled by the system.
The borders of the system to be analyzed from a safety point of view, and the interfaces to other system parts, are often not clearly defined at the beginning of a project. It is e.g. not clear which components are part of the system, which sensors/actuator variants shall be considered, etc. Starting a safety analysis without defining the scope is simply not possible, without avoiding several iterations each time the scope is changed.
There is not something like separate SIL level of a Failure Detection and Monitoring blocks (Diagnostic) belonging to a safety function classified to certain SIL. Some standards allow to reduce the required integrity level by 1, but from a functional safety point of view the diagnostic is part of the safety functions. Integrity requirements apply to diagnostic as well. Further decomposition analysis can allow to reduce the integrity level, but usually not to exclude any safety requirements.
Some integrity/diagnostic measures are performed at each startup of the system (e.g. RAM test, Flash CRC check, output tests, etc.). These tests are very important, as they are e.g. used for argumentation on diagnostic measures, or for fault detection time (test interval) in the safety analysis. However, sometime the operation condition can change (new project, new requirements, etc.). Systems which were regularly powered-down, stay powered-up for longer time. For example, modern rail systems are nowadays often continuously powered-on, without re-start every morning. In such cases the intended diagnostic measures will not become effective. If needed by the safety analysis, regular re-start shall be therefore explicitly specified in the safety manual, operator documentation, etc.