After teaching the ISA EC 52 course – Advanced Design and SIL Verification, I am reminded of the lack of knowledge regarding SIL verification calculations that many practitioners have. This ignorance is often compounded by the use of quick and easy software tools that result in improper calculations because the users of the software input information that is not reflective of their actual operation (or the software doesn’t ask for the proper information because it is not capable of properly processing it. This problem extends to the ISA TR84.00.02 technical report that provides calculations that are loaded with assumptions that are not necessarily true in practice.
In order to shed some light on the topic, let me explain ALL of the parts of a SIL verification equation for a 1oo1 voting arrangement, and compare that against what is typically modeled. First, a listing of all of the components in verbal terms:
- Unreliability due to dangerous undetected failures
- Unavailability due to dangerous detected failures
- Unreliability due to common cause failures
- Unreliability due to never detected failures
- Unavailability due to on line testing
I’m not going to re-hash the difference between unreliability, as I’ve discussed it in another blog. Suffice it to say that different equations are used in the form of L*T/2 for unreliability and L*MTTR for unavailability, where L is the failure rate, T is the test interval, and MTTR is the mean time to repair (or mean time of beign in a dangerous state).
Although you see five terms listed above, if you go to the ISA 84 TR 84.00.02 technical report, only one term is shown – L(DU) * T / 2. This term is item 1 in the list above, or the unreliability due to dangerous undetected failures. Why are all of the rest of the terms ignored? Because a lot of assumptions have been made about how you will use the device in question, and these assumptions may not be correct.
The unavailability of dangerous detected failures is essentially the fraction of time when a SIS component is unavailable to perform its action because it has failed in a way that is diagnosed, but the device has not been repaired yet. The equation for this term is L(DD) * (MTTR + TI(A)/2), where L(DD) is the dangerous detected, MTTR is the mean time to repair, and TI(A) is the “automatic” test interval – or the interval at which the diagnostic tests occur. This term is frequently dropped from the calculation for PFD because it is not relevant if a detected failure results in a vote to trip. If for instance, a failure in a pressure sensor is detected by diagnostics, and this failure propagates into an immediate automatic shutdown of the plant it is essentially immediately converted into a safe failure. In this paradigm, the SIS component is never unavailable (as the result of a dangerous detected failure) because a shutdown of the plant immediately follows the failure. But, if the system is configured for a diagnosed failure to result in an alarm, and the process continues to operate in the presence of the failed component, then the unavailability of the system must be included by adding in the unavailability of dangerous detected failures term. This fact is often overlooked by practitioners.
The unreliability of common cause failures term is fairly commonly used by practitioners, but there are still some who erroneous do not include them because they feel that common cause failures are not credible in a well designed system. History has shown that this mind set is ill-advised and dangerous. The common cause failure unreliability is not relevant, though, to a 1oo1 voting arrangement because there is only one component.
Unreliability due to never detected failures is another term that is commonly ignored, and perhaps should not be. If this term is ignored, there is an assumption that each proof test will identify 100% of dangerous failures. While this should be the objective of test plans and test plan development, it is optimistic in practice. In reality, some devices (such in level sensors that are directly inserted into vessels) can not be tested in situ, requiring them to be removed and testing in an instrument shop. This type of test can fail to identify vessel related or process connection failures. Another good example is high pressure drop shutoff valves. Testing usually occurs when the plant is offline and not facing the stress of the pressure drop. IT can happen that a valve can stroke fine while there is no pressure drop, but not be able to stroke while actually in service. Unless you can justify 100% manual test coverage the unreliability due to never detected failures should be included, it is L(DN) * Life / 2, where L(DN) is the failure rate of never detected failures, and Life is either the useful life of the component or the amount of time between major overhaul or rebuild.
Finally, the unavailability due to online testing term is often ignored because all testing of a component occurs while the plant is offline for a turnaround. If this is not the case, and the component must be tested while the plant is online, then it is unavailable to perform its shutdown action while it is in bypass for the testing to occur. The term is a simple unavailability, TD/TI(M), where TD is the duration of the test and TI(M) is the manual proof test interval. It is important to note that TD is the total test duration during a manual proof test interval. As such, if multiple on-line tests occur during a single manual proof test interval, they all need to be added up to calculate the total TD.
As you can see, calculation of PFD for SIL verification many include multiple terms that are commonly ignored. Deciding whether or not these terms are relevant requires knowledge of how the system is configured, operated, and tested. A SIL verification software tool does not fill this information in for you, and often makes assumptions about how you operate your plant that may not be true. As a result, you need to be vigilant in your calculations to make sure that they are accurate.