A defect database that tracks errors in processes (or even products) – a database that is so sophisticated that it actually tracks where the defect occurred in addition to the type of defect – can provide powerful information. It can be quite helpful in scoping and prioritizing potential improvement opportunities. But is the data trustworthy? Is the defect database providing the correct information?

The process of understanding the defect, choosing and assigning the appropriate code for the defect, identifying and recording where the defect occurred, and even perhaps assigning a severity level to that defect are steps that must occur in a defect database measurement system.

Figure 1: Process Map
Figure 1: Process Map

Like any measurement system, the precision and accuracy of the database must be understood before using (or at least while using) the information to make decisions. At first glance, it might seem that the obvious place to start is with an attribute agreement analysis (or attribute gage R&R). That might not be such a great idea, however. 

Because executing an attribute agreement analysis can be time consuming, expensive and generally inconvenient to all those involved (the analysis is simple compared to the execution), it is best to take a moment to really understand what should be done and why. 

A Great Idea That Is Tough to Implement

First, the analyst should firmly establish that there is, in fact, attribute data. It is reasonable to assume that the assignment of a code – that is, classification of a code into a category – is a decision that characterizes the defect with an attribute. Either a defect is properly assigned a category or it is not. Likewise, the correct source location is either assigned to the defect or it is not. These are “yes” or “no” and “correct assignment” or “incorrect assignment” answers. That part is pretty simple.

Once it is firmly established that the defect database is an attribute measurement system, the next step is to explore the notions of precision and accuracy as they relate to the situation. First of all, it helps to understand that precision and accuracy are terms that are borrowed from the world of continuous (or variable) gages. For example, it is desirable that the speedometer in a car accurately reads the correct speed across a range of speeds (e.g. 25 mph, 40 mph, 55 mph, and 70 mph) regardless of who is reading it. The absence of bias across a range of values over time can generally be called accuracy (bias can be thought of as being wrong on average). The ability of different people to interpret the same value from the gage several times and agree with each other is called precision (and precision issues may stem from a problem with the gage, not necessarily the people using it).

A defect database, however, is not a continuous gage. The values assigned are correct or they are not; there is no (or there shouldn’t be any) gray area. If the codes, locations and severities are defined effectively, then there is just one correct attribute for each of those categories for any given defect. 

Unlike a continuous gage, which can be accurate (on average) but not precise, any lack of precision in an attribute measurement system will necessarily create accuracy problems, too. If the defect coder is unclear or indecisive about how a defect should be coded, then multiple defects of the same type will have different codes assigned, rendering the database inaccurate. In fact, for an attribute measurement system, imprecision is an important contributor to inaccuracy. 

The precision of any measurement system is analyzed by segmenting it into two core components: repeatability (the ability of a given assessor to assign the same value or attribute multiple times under the same conditions) and reproducibility (the ability of multiple assessors to agree among themselves for a given set of circumstances). For an attribute measurement system, problems with either repeatability or reproducibility will necessarily create accuracy problems. Furthermore, if overall accuracy and repeatability and reproducibility are known, then bias can also be discovered in situations where the choices are consistently wrong. 

Challenges in Performing an Attribute Agreement Analysis

An attribute agreement analysis is designed to simultaneously evaluate the impact of repeatability and reproducibility on accuracy. It allows the analyst to examine the responses from multiple reviewers as they look at several scenarios multiple times. It produces statistics that evaluate the ability of the appraisers to agree with themselves (repeatability), with each other (reproducibility), and with a known master or correct value (overall accuracy) for each characteristic – over and over again.

Analytically, this technique is a wonderful idea. But in practice the technique can be difficult to execute in a meaningful way. First, there is always the issue of sample size. For attribute data, relatively large samples are needed to be able to calculate percentages with reasonably small confidence intervals. If an assessor examines 50 different defect scenarios – twice – and the match rate is 96 percent (48 of 50 chances to match), then the 95 percent confidence interval ranges from 86.29 percent to 99.51 percent. That’s a pretty large margin of error, especially given the challenge of selecting the scenarios, reviewing them thoroughly to ensure the proper master value is assigned, and then convincing the appraiser to actually do the work – twice. If the number of scenarios is increased to 100, the 95 percent confidence interval for a 96 percent match rate narrows to a range of 90.1 percent to 98.9 percent (Figure 2).

Figure 2: Attribute Agreement Analysis
Figure 2: Attribute Agreement Analysis

This example uses an assessment of repeatability to illustrate the idea, and it applies to reproducibility as well. The point here is that a lot of samples are needed to detect differences in an attribute agreement analysis, and if the number of samples is doubled from 50 to 100, the test does not become a whole lot more sensitive. Of course, the difference needed to be detected depends on the situation and the level of risk the analyst is willing to bear in the decision, but the reality is that with 50 scenarios, an analyst will be hard pressed to assume there is a statistical difference in the repeatability of two appraisers with match rates of 96 percent and 86 percent. With 100 scenarios, the analyst will barely be able to detect a difference between 96 percent and 88 percent. 

In addition to the sample size issue, the logistics of ensuring that appraisers do not remember the original attribute they assigned to a scenario when they see it for the second time can also be a challenge. Of course, this can be avoided somewhat by increasing the sample size and, better yet, waiting awhile before giving the appraisers the set of scenarios for a second time (perhaps one to two weeks). Randomizing the runs from one review to the next can also help. In addition, appraisers also tend to perform differently when they know they are being examined, so the fact that they know it is a test may also bias the results. Concealing this in some way may help, but that is next to impossible to achieve, notwithstanding the fact that it borders on unethical. And aside from being marginally effective at best, these solutions add complexity and time to an already challenging study. 

Finally – and this is an additional source of complexity that is inherent to defect database measurement systems – the number of different choices of codes or locations can be ponderous. To find scenarios that allow the examination of repeatability and reproducibility of every possible condition can be overwhelming. If the database has, say, 10 different defect codes that could be assigned, the analyst should select scenarios carefully to provide an adequate representation of the different codes or locations that could be assigned. And realistically, a choice of 10 different categories for defect type is on the low end of the scale of what defect databases commonly allow. 

The Right Way to Perform an Attribute Agreement Analysis

Despite these difficulties, performing an attribute agreement analysis on defect databases is not a waste of time. In fact, it is (or can be) a tremendously informative, valuable and necessary exercise. The attribute agreement analysis just needs to be applied judiciously and with a certain level of focus. 

Repeatability and reproducibility are components of accuracy in an attribute measurement system analysis, and it is wise to first determine whether or not there is an accuracy issue at all. That means that before an analyst designs an attribute agreement analysis and selects the appropriate scenarios, he or she should strongly consider performing an audit of the database to determine whether or not past events have been coded properly. 

Assuming the accuracy rate (or the most likely failure modes) of the defect database is unknown, it is wise to audit 100 percent of the database for a reasonable frame of recent history. What is reasonable? That really depends, but to be safe, at least 100 samples over a recent, representative time period should be examined. The definition of reasonable should consider how the database information is intended to be used: to prioritize projects, investigate root cause or assess performance. One hundred samples for an audit is a good place to start because it gives the analyst a rough idea of the overall accuracy of the database. 

For example, if the calculated accuracy rate with 100 samples is 70 percent, then the margin of error is about +/- 9 percent. At 80 percent, the margin is about +/- 8 percent, and at 90 percent the margin is +/- 6 percent. Of course, more samples can always be collected to audit if more precision is needed, but the reality is that if the database is less than 90 percent accurate, the analyst probably wants to understand why. 

That is when the attribute agreement assessment should be applied, and the detailed results of the audit should provide a good set of information to help understand how best to design the assessment.

Using a Team to Identify Defects

Often what you are trying to assess is too complex to rely on the effectiveness of one person alone. Examples of this include contracts, engineering drawings with specifications and bills of materials, and software code. One solution is using a team-based approach or an inspection/review meeting where identifying defects is the primary focus of the meeting. Often, several people can achieve a common single assessment that is better than what any one of them could have produced alone. This is one way of mitigating the sources of repeatability and reproducibility that are the most difficult to control.

The audit should help to identify which specific individuals and codes are the greatest sources of trouble, and the attribute agreement assessment should help identify the relative contribution of repeatability and reproducibility issues for those specific codes (and individuals). In addition, many defect databases have problems with accuracy records indicating where a defect was created because the location of where the defect is found is recorded instead of where the defect was created. Where the defect is found does not help much in identifying root causes, so accuracy of the location assignment should also be an element of the audit. 

If the audit is planned and designed effectively, it may reveal enough information about the root causes of the accuracy issues to justify a decision not to use the attribute agreement analysis at all. In cases where the audit does not provide sufficient information, the attribute agreement analysis will enable a more detailed examination that will inform how to deploy training and mistake-proofing changes to the measurement system. 

For example, if repeatability is the primary problem, then the assessors are confused or indecisive about certain criteria. If reproducibility is the issue, then assessors have strong opinions about certain conditions, but those opinions differ. Clearly, if the problems are exhibited by several assessors, then the issues are systemic or process related. If the problems relate to just a few assessors, then the issues could simply require a little personal attention. In either case, the training or job aids could either be tailored to specific individuals or to all assessors, depending on how many assessors are guilty of inaccurate assignment of attributes. 

The attribute agreement analysis can be an excellent tool to reveal the sources of inaccuracies in a defect database, but it should be employed with great care, consideration and minimal complexity if it is used at all. This is best achieved by first auditing the database and then using the results of that audit to construct a focused and optimized analysis of repeatability and reproducibility.

About the Author