Font Size
Six Sigma Tools & Templates Measurement Systems Analysis (MSA)/Gage R&R Making Sense of Attribute Gage R&R Calculations

Making Sense of Attribute Gage R&R Calculations

Measurement error is unavoidable. There will always be some measurement variation that is due to the measurement system itself.

Most problematic measurement system issues come from measuring attribute data in terms that rely on human judgment such as good/bad, pass/fail, etc. This is because it is very difficult for all testers to apply the same operational definition of what is “good” and what is “bad.”

However, such measurement systems are seen throughout industries. One example is quality control inspectors using a high-powered microscope to determine whether a pair of contact lens is defect free. Hence, it is important to quantify how well such measurement systems are working.

The tool used for this kind of analysis is called attribute gage R&R. The R&R stands for repeatability and reproducibility. Repeatability means that the same operator, measuring the same thing, using the same gage, should get the same reading every time. Reproducibility means that different operators, measuring the same thing, using the same gage, should get the same reading every time.

Attribute gage R&R reveals two important findings – percentage of repeatability and percentage of reproducibility. Ideally, both percentages should be 100 percent, but generally, the rule of thumb is anything above 90 percent is quite adequate.

Obtaining these percentages can be done using simple mathematics, and there is really no need for sophisticated software. Nevertheless, Minitab has a module called Attribute Agreement Analysis (in Minitab 13, it was called Attribute Gage R&R) that does the same and much more, and this makes analysts’ lives easier.

Having said that, it is important for analysts to understand what the statistical software is doing to make good sense of the report. In this article, the steps are reproduced using spreadsheet software with a case study as an example.

Steps to Calculate Gage R&R

Step 1: Select between 20 to 30 test samples that represent the full range of variation encountered in actual production runs. Practically speaking, if “clearly good” parts and “clearly bad” parts are chosen, the ability of the measurement system to accurately categorize the ones in between will not be tested. For maximum confidence, a 50-50 mix of good/bad parts is recommended. A 30:70 ratio is acceptable.

Step 2: Have a master appraiser categorize each test sample into its true attribute category.

Figure 1: Master Appraiser Categorizations

Figure 1: Master Appraiser Categorizations

Step 3: Select two to three inspectors and have them categorize each test sample without knowing what the master appraiser has rated them.

Step 4: Place the test samples in a new random order and have the inspectors repeat their assessments.

Figure 2: Test Samples

Figure 2: Test Samples

Step 5: For each inspector, count the number of times his or her two readings agree. Divide this number with the total inspected to obtain the percentage of agreement. This is the individual repeatability of that inspector (Minitab calls this “Within Appraiser”).

To obtain the overall repeatability, obtain the average of all individual repeatability percentages for all inspectors. In this case study, the overall repeatability is 95.56 percent, which means if the measurements are repeated on the same set of items, there is a 95.56 percent chance of getting the same results, which is not bad but not perfect.

Figure 3: Individual Repeatability

Figure 3: Individual Repeatability

In this case, the individual repeatability of Operator 1 is only 90 percent. This means that Operator 1 is only consistent with himself 90 percent of the time. He needs retraining.

Step 6: Compute the number of times each inspector’s two assessments agree with each other and also the standard produced by the master appraiser in Step 2.

Figure 4: Individual Effectiveness

Figure 4: Individual Effectiveness

This percentage is called the individual effectiveness (Minitab calls this “Each Appraiser vs. Standard”). In this case, Operator 1 is in agreement with the standard only 80 percent of the time. He needs retraining.

Step 7: Compute the percentage of times all the inspectors’ assessments agree for the first and second measurement for each sample item.

Figure 5: Reproducibility of the Measurement System

Figure 5: Reproducibility of the Measurement System

This percentage is the reproducibility of the measurement system (Minitab calls this “Between Appraiser”). All three inspectors agree with each other only 83.3 percent of the time. They may not be all using exactly the same operational definition for pass/fail all the time or may have a very slight difference in interpretation of what constitutes a pass and a failure.

Step 8: Compute the percentage of the time all the inspectors’ assessments agree with each other and with the standard.

Figure 6: Overall Effectiveness of the Measurement System

Figure 6: Overall Effectiveness of the Measurement System

This percentage gives the overall effectiveness of the measurement system (Minitab calls this “All Appraiser vs. Standard”). It is the percent of time all inspectors agree and their agreement matches with the standard.

Minitab produces a lot more statistics in the output of the attribute agreement analysis, but for most cases and use, the analysis outlined in this article should suffice.

So What If the Gage R&R Is Not Good?

The key in all measurement systems is having a clear test method and clear criteria for what to accept and what to reject. The steps are as follows:

  1. Identify what is to be measured.
  2. Select the measurement instrument.
  3. Develop the test method and criteria for pass or fail.
  4. Test the test method and criteria (the operational definition) with some test samples (perform a gage R&R study).
  5. Confirm that the gage R&R in the study is close to 100 percent.
  6. Document the test method and criteria.
  7. Train all inspectors on the test method and criteria.
  8. Pilot run the new test method and criteria and perform periodic gage R&Rs to check if the measurement system is good.
  9. Launch the new test method and criteria.

Register Now

  • Stop this in-your-face notice
  • Reserve your username
  • Follow people you like, learn from
  • Extend your profile
  • Gain reputation for your contributions
  • No annoying captchas across site
And much more! C'mon, register now.

Leave a Comment


José Luis Dorbecker


Te envío este artículo para que conozcas de manera muy sencilla como se realiza un análisis de repetibilidad y reproducibilidad. Pienso que esta idea será muy buena para cuantificar la habilidad de las personas para hacer buenas investigaciones. Por mi parte yo lo estaré utilizando para mejorar la métrica de CAPA.

Brian Stewart

Where does the 20-30 samples requirement come from? Is there some statistical significance of this number?

Alex T

20-30 Samples is an MSA,4th Ed requirement. Sample is not the part.It is the data sample
The more samples , less bias correction factor d2*


Specifically, is there a minimum requirement of Parts/Trials needed for Attribute R&Rs? Just curious. I work in automotive and they do 3 appraisers with 50 parts for 3 trials. Seems a little overkill. Can this be minimized?


As far as Im aware there is no minimum is MSA 4thed. The only thing the high part number affects is the confidence intervals of the %agreement within appraisers and %agreement versus standards, which in turn can affect the evaluation of your null hypothesis. With 20 parts, the difference between the upper and lower conf interval limit can be around 40%, with 50 parts its about 20-25%.
For example if you do a test with 3 appraisers and 20 parts. the % agreement versus standard is 70% (+-20%) for A, 65%(+-20%) for B and 55% (+-20%) for C, then your 0 hypotesis stands, all appraisers’ %agreement are within each other’s confidence intervals, so they are “similar” in reliability. If you do the same test with 50 parts then the results might look like this: 70%(+-10%) for A, 65% (+-10%) for B and 55%(+-10%) for C. Now, the 0 hypothesis is no longer true, as appraisers A and C show significant difference in reliability based on the confidence intervals.
Thats my take on it.


One more thing to add, is that the MSA manual doesnt mention that the Cohen’s Kappa metric’s validity and accuracy is widely debated among statisticians even today despite it being one of the most widely used rater reliability metric. Make sure you dont evaluate any rater or process solely on that metric. Always include the % agreement stats, false positives, etc.


If you have to run a Kappa test ; can you run the same on multiple Operators , Multiple attempts & Multiple Parametres…..?


I found this post incredibly helpful – Thank you

Eddie Thomas

Hi guys, great post although I think i might have spotted a couple of errors.

– Figure 4 has the same number of matched/inspected but a different %age value…
– The formula in figure 5 refers to column J which appears to be hidden from the image.

Hope this helps,

Navin G. Pithadia

very good study, understandable method-good

Rio Mauhay

In the formula, you entered E,J,O columns. But I cannot see the value or the formula.

Rahul Thakur

Wonderful & informative. Can you please send me the complete excel file of this article ?




Anybody can explain procedure for kappa study.
And necessity of this kappa.


Dear all,
Please give me advice about time to do GR&R.
How long we should give the operator do GR&R per time?

Michael Clayton

Costs matter. Do you really know the value of less uncertainty to the product cost or reliability?
Does the product have later-stage-of-completion screening test that would protect customers from maverick parts missed by uncertain gaging at earlier step?
Asking cost questions at product engineering level, rather than giving in to local step interruption costs, might be first step in planning a gage study. Local supervision has goals that may stop any studies so having bigger view (and support) of benefits of certainty are important. When asked to do a gage study, ask what the cost of sampling (and thus time operator is part of the study) might be.

BUT if the gage study has a mandated sample size as many ISO-required studies seems to suggest, then you are only dealing with small time variation of operator performance rather than number of parts sampled or repeats made. If you keep track of operator-to-operator TIME as well as bias or GRR% details a discussion that follows can help with followup studies.

Michael Clayton

All statistical methods are VERY sensitive to sample size to the point where MOST hypothesis tests will reject the null if the sample size is large. That has nothing to do with the ECONOMIC difference between alternative data categories. Attribute data, however, requires much larger samples than continuous data to get to that “always rejects” level. In both cases the economic domain issues dominate the decision processes. That said, if sample size is very limited and no significant differences are seen, one should still go back to the economic domain experts and find out what cost of greater sampling would be justified in order to be more certain that one is not missing a key factor, if for no other reason than to challenge the factor definition, granularity, etc.

Just my opinion due to lower cost of getting data in modern times.

However. Many experts argue that too much data will miss opertunities. Would someone please comment on that argument?


Je voulais savoir s’il vous plait si la méthode utilisée pour le test R&R attributs peut être considérée comme une source fiable?
Quelle est votre base académique (et bibliographique) pour faire ce test?
I wanted to know please if the method used for the R & R attributes test can be considered a reliable source?

What is your academic base (and literature) for this test?

Avinoam Moses

Dear Sir or Madam

As I know for GR&R with variable measures needs between 6-10 parts, 3 appraiser measure these parts 3 times each, in order to build assembly point (reproducibility)

As I understand from this article, for Attributive data 2 appraiser is enough.
Is there any formula combination that says for 3 appraiser minimum parts is X and for 2 appraiser minimum parts is Y (where Y > X)?
Is it also acceptable to use 2 appraiser for variable measures?

Please advise



Neeraj Vashishath

Pls send me the excel sheet


5S and Lean eBooks

Six Sigma Online Certification: White, Yellow, Green and Black Belt

Six Sigma Statistical and Graphical Analysis with SigmaXL
Six Sigma Online Certification: White, Yellow, Green and Black Belt
Lean and Six Sigma Project Examples
GAGEpack for Quality Assurance
Find the Perfect Six Sigma Job

Login Form