Adding value with our service portfolio
TÜV SÜD has more than 20 years of experience in testing and certifying functional safety related systems and components. During this time our employees have observed recurring issues arising from misunderstandings of some functional safety concepts. The list below highlights some of the most interesting errors everyone should be aware of.
PFD/PFH is the most dominant parameter in the discussion of SIL calculations. But it may dominate the dialogue too much. Both values address random hardware faults and are usually calculated with the use of FMEDAs. They both express the probability of a dangerous failure of a safety related system or sub-system. But they are not the only calculation involved in determining a SIL and cannot stand alone as the sole data point because random hardware faults are not the only concern. It is also important to avoid and control systematic failures in hardware and software. Systematic capability reflects the methods and techniques used during the development of the safety related system. The rigor of these methods and techniques are critical in avoiding systematic failures - the type of failure that is inherent in the design. While the PFD/PFH determines the random failure of hardware of critical safety systems to minimise the possibility of a failure when failure is dangerous, the systematic capability minimises the possibility that there are errors in the design that would compromise safety when it’s needed most.
Sometimes system integrators and plant manufacturers require their suppliers to deliver control systems for normal operation with a SIL, thinking that this will ensure a certain reliability of the control system and/or utility. But the aim of a safety function (which is performed by a safety related system) is to put an Equipment Under Control (EUC) into a safe state not to increase availability.
A safe state of a EUC is a result of the hazard and risk analysis and depends on its different operational modes. In this context, we frequently observe misunderstandings about strategies and concepts regarding fail safe scenarios, for example shutting down the EUC in case of a failure, and fail operational scenarios, i.e. e.g. keep the EUC as much as possible in operation. In fact, using SIL as a measure of the reliability of the control system, is flawed due to the details of the calculations.
Random hardware failures (and reliability) are calculated based on failure rates. In terms of functional safety failure rates are split into safe and dangerous failures. Only dangerous failures (preventing the safety function to perform as intended) are considered in the calculation of the PFD/PFH values. A SIL therefore is only a degree of reliability that the safety function will perform as intended when it is required to put the EUC in a safe state and does not reflect the reliability of the control system.
Internal watchdogs cannot be solely relied on to detect a fault and provide a signal to indicate there is a fault. The watchdog is part of the possible defected microcontroller; therefore it isn’t possible to rely on the watchdog to provide the diagnostic coverage. In addition, it’s important to point out that many watchdogs do not provide an output but reset the controller.
IEC61508 includes the possibility of using a proven in use approach for both hardware (Route 2H) and software (Route 2S). In SW, individual failures need to be detected and reported during the observation period and all combinations of input data, sequences of execution and timing relations must be documented. While each developer needs to consider the pros and cons of evaluating the systematic capability by using the techniques and measures compared to the proven in use approach, most often proven in use for SW (route 2S) is not a practical approach.
Non-safety parts/components/design elements are often mixed with safety systems. This is a standard approach that can reduce complexity and system costs. However, to do this, you must provide evidence for freedom of interference when mixed safety integrity level design techniques are used. The analysis includes detailed hardware or software failure mode analysis including the non-safety parts and the effect on the safety system.
In the development of safety critical software, the usage of automated tools is nowadays state of the art in terms of efficiency and safety. Several times however, we have seen a misguided attempt to use these tools. First the unit test specification shall be written based on the software unit/module specification and not only by white-box analysis of the source code. The goal is to test the functional behaviour of the software unit and determine if the behaviour corresponds to the safety specification. White box analysis is still a useful tool at this point, but it does not solely provide evidence of the test coverage required at the code level.
In modern complex safety relevant systems, the usage of smart sensors is becoming more and more prevalent. The sensors must either have a SIL to correspond to the risk level or the designer is required to provide proof of freedom of interference to the overall safety function. Some system designs are wrongly based on the assumption that the failure modes of the complex smart sensors (i.e. with logic processing unit, communication protocol, etc.) can be diagnosed and mitigated by some external safety unit within the required diagnostic coverage and failure detection/reaction time.
Power supplies are typically overlooked. We see recurrent issues related to missing protection/supervision of power supply. The safety standards do not specify how overvoltage/undervoltage measures shall be implemented in detail, but, depending on the SIL level, suitable mechanisms for protection shall be in place. One of the most common oversights is not implementing redundancy on the overvoltage protection when HFT >1 is required. Another common occurrence is the lack of a separate component that is capable of a wider voltage range and therefore supply voltage monitoring (including the required reaction like switch-off and safe state) should be done by a separate component which is capable of a wider voltage range, and has thus not itself affected by the over/undervoltage condition. Even a short duration overvoltage may permanently damage parts of the microcontroller, without the damage being obvious. Thus, a reset of the microcontroller after an overvoltage is not a suitable measure; however, after an undervoltage condition a reset can be ok, if the undervoltage doesn’t persisting.
The overtemperature detection and the respective reaction (enter safe state, switch-off, …) cannot be done by a component (e.g. microcontroller) which is operated outside the specified temperature range. A simple reset is also not sufficient, as it has to be assumed that the overtemperature condition persists or is already present during switch-on. Low temperature is sometimes not considered by safety standards (i.e. EN50129), but the application conditions (as stated in the user manual) should make sure that the device is only driven within the specified temperature range (for all components).
The abbreviation SIL (Safety Integrity Level) is used by several standards. In each standard, it relates however to different process, architectural and technical requirements. Some of the standards use different abbreviations like ASIL (Automotive SIL). Some others use the same abbreviation (SIL), with the risk of confusion and misunderstandings.
For example the SAS level (german:“Sicherheitsanforderungsstufe” for Safey Integrity Level) as it is used by SIRF (Sicherheitsrichtlinie Fahrzeug; EBA) is not the same as SIL level (“Safety Integrity Level”) as used by EN 50128 /EN 50129. SIRF (SAS) only uses part of the measures listed in appendix E of EN 50129, less organizational and process measures, and no defined limit for the hazard rate. SIL in the context of IEC 61508 is different than SIL in the context of EN50129. Sometime more complexity is added by defining SIL as Software Integrity Level instead of Safety Integrity Level (see e.g. EN 50657).
It is not possible to raise the SIL level of a system by simply combining SIL systems. For example, it is not possible to increase the SIL to SIL4 by simply combining several SIL2 systems/channels or reach SIL3 on system level by several SIL2 control systems. SIL2 software processes do not prevent systematic software faults in the same way as required for SIL4 ones, for one example. The IEC 61508 standard does allow for an increase in the systematic capability by 1 if two independent systems are used in a dual channel structure. In case of homogeneous redundancy this is not possible. Other standards (i.e. EN 50129) do not provide this possibility.
Common cause is often forgotten in Fault Tree Analysis (FTA). Failures which can affect several parts of a system at the same time (e.g. loss of power supply) must be included as basic events in all relevant branches of the Fault Tree. An alternative approach is to implement common cause by using a “Beta Factor” in FTA tool.
The definition of SIL (safety integrity level) applies to functions that have been in origin derived by some kind of risk analysis/classification. For a system/component it is common to hear sentence like “a SIL 2 controller”, “a SIL3 brake system”, etc. The correct usage of the SIL terminology is the SIL is only related to specific (safety) functions. Which functions are safety ones is the goal of the risk analysis? A more correct definition is that a “system is capable to implement safety functions up to SILx”. This means that the process, architectural and technical requirements of the standards are fulfilled by the system.
It is important to clearly define the borders of the system. he borders of the safety system to be analysed, and the interfaces to other system parts, are often not clearly defined at the beginning of a project. It is, then, not clear which components are part of the system, which sensors/actuator variants shall be considered, etc. Starting a safety analysis without defining the scope will cause the analysis to undergo several needless iterations each time the scope is changed.
There is not something like separate SIL level of a Failure Detection and Monitoring blocks (Diagnostic) belonging to a safety function classified to certain SIL. Some standards allow to reduce the required integrity level by 1, but from a functional safety point of view the diagnostic is part of the safety functions. Integrity requirements apply to diagnostic as well. Further decomposition analysis can allow reducing the integrity level, but usually not to exclude any safety requirements.
Some integrity/diagnostic measures are performed at each startup of the system (e.g. RAM test, Flash CRC check, output tests, etc.). These tests are very important, as they are e.g. used for argumentation on diagnostic measures, or for fault detection time (test interval) in the safety analysis. However, sometime the operation condition can change (new project, new requirements, etc.). Systems which were regularly powered-down, stay powered-up for longer time. For example, modern rail systems are nowadays often continuously powered-on, without re-start every morning. In such cases the intended diagnostic measures will not become effective. If needed by the safety analysis, regular re-start shall be therefore explicitly specified in the safety manual, operator documentation, etc.
Learn how to avoid functional safety errors in future safety projects.
Successfully achieving the safety and flexibility balance
A compact overview of the functional safety regulation landscape
Learn about current trends and challenges and get an overview about opportunities offered by functional safety.
Find the right software tools for your functional safety projects.
Learn which safety-related functionalities are applicable.