On Measuring Agreement

posted Aug 25, 2013, 6:43 PM by Admin PEMRF   [ updated Aug 25, 2013, 9:58 PM ]

Kappa is naughty and  should be sent to bed early

A current project that the foundation is supporting involves measuring inter rater agreement. Pediatric emergency medicine tends to look no further than kappa   (denoted by the Greek letter  κ). This is a pity.  Cohen’s κ was originally designed as a single summary statistic to describe categorical chance adjusted agreement.  Cohen’s κ appears easily interpreted; its range is -1 to +1, implying perfect disagreement and agreement respectively. Descriptive terms such as ‘moderate’ and ‘poor’ agreement have been published to further ease interpretation. Physician researchers in particular seem to like it. A quick jaunt through the EM journals inter rater studies shows it to be nearly as popular as vicodin. The use of a single measure that adjusts for chance agreement is seductive.  So challenging its use is likely to be received, well like a prescription for ibuprofen. But there is a wrinkle with κ.

A disadvantage of this κ statistic is that it results in lower values the further the prevalence of the outcome being studied deviates from 0.5.  Scott’s π (subsequently extended by Fleiss) suffers the same limitations.[4] This so called paradox of κ arises where very high agreement is observed along with a dismal κ statistic is well known to statisticians. Consequently Cohen’s κ and Scott’s π should be avoided when one of the categories being rated is much more or less common than another. An alternative method, the agreement coefficient (AC1) has been proposed by Gwet.  The AC1 is more stable than κ although the AC1 may give slightly lower estimates than κ when the prevalence of a classification approaches 0.5. The AC1 is relatively new and as yet does not appear widely in the medical literature despite recommendations to use it. To run the AC1 you can use the SAS macro or buy the Excel implementation ($45 or so) or use this function in R (don't be scared by the strange url). AC1 does not seem to be implemented in Stata although it would not be difficult to do. AC1 one may not be the panacea it appears either. One of its underlying premises is that chance agreement occurs only if at least one of the raters on occasion rates some individuals randomly. 

For ordinal scales a weighted kappa has been proposed. The penalty for disagreement is weighted according to the number of categories by which the raters disagree. The results are dependent both on the weighting scheme chosen by the analyst and the relative prevalence of the categories. Thoughtful use of weights reduces to the intraclass correlation but of course the dirty little secret is that you *could* set the weights to anything you want. Scott’s π and Gwet’s AC1 can also be weighted. When weighted,  Gwet’s AC1 is often referred to as AC2 and the same weighting caveats apply.


A different approach is to regard ordinal categories as bins on a continuous scale. Polychoric correlation estimates the correlation between raters as if they were rating on a continuous scale. Polychoric correlation is at least in principle insensitive to the number of categories and can even be used where raters use different numbers of categories. The correlation coefficient, -1 to +1, is interpreted in the usual manner. A disadvantage of polychoric correlation is that it is susceptible to distribution; although some recognize polychoric correlation as a special case of latent trait modeling thereby allowing relaxation of distribution assumptions.  It is easy to conceive situations where  assumptions of a normal distribution are unlikely to hold.

Another coefficient of agreement “A” proposed by van der Eijk was specifically designed for ordinal scales with a relatively small number of categories dealing with abstract concepts.  This measure “A” is insensitive to standard deviation.  “A” however contemplates large numbers of raters rating a small number of subjects (such as voters rating political parties). This seems less applicable to clinical PEM but might be useful when asking lots of physicians to rate a few management strategies or patients/parents a few hospitals.  This has been implemented in Stata which seems to be the most widely used stats program in the specialty.