Similarity Indices — rand • catsim

The Rand index, rand_index, computes the agreement between two different clusterings or partitions of the same set of objects. The inputs to the function should be binary or categorical and of the same length.

The adjusted Rand index, adj_rand, computes a corrected version of the Rand index, adjusting for the probability of chance agreement of clusterings. A small constant is added to the numerator and denominator of the adjusted Rand index to ensure stability when there is a small or 0 denominator, as it is possible to have a zero denominator.

Cohen's kappa, cohen_kappa, is an inter-rater agreement metric for two raters which corrects for the probability of chance agreement. Note there is a difference here between this measure and the Rand indices and mutual information: those consider the similarities of the groupings of points, while this considers how often the raters agreed on individual points.

Like the Rand index, the mutual information computes the agreement between two different clusterings or partitions of the same set of objects. If \(H(X)\) is the entropy of some probability distribution \(X\), then the mutual information of two distributions is \(I(X;Y) = -H(X,Y) +H(X) + H(Y)\). The normalized mutual information, normalized_mi, is defined here as: \(2I(X;Y)/(H(X)+H(Y)),\) but is set to be 0 if both H(X) and H(Y) are 0.

The adjusted mutual information, adjusted_mi, is a correction of the mutual information to account for the probability of chance agreement in a manner similar to the adjusted Rand index or Cohen's kappa.

rand_index(x, y, na.rm = FALSE)

adj_rand(x, y, na.rm = FALSE)

cohen_kappa(x, y, na.rm = FALSE)

normalized_mi(x, y, na.rm = FALSE)

adjusted_mi(x, y, na.rm = FALSE)

Arguments

x, y: a numeric or factor vector or array
na.rm: whether to remove NA values. By default, FALSE. If TRUE, will perform pair-wise deletion.

Value

the similarity index, which is between 0 and 1 for most of the options. The adjusted Rand and Cohen's kappa can be negative, but are bounded above by 1.

References

W. M. Rand (1971). "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association. American Statistical Association. 66 (336): 846–850. doi:10.2307/2284239

Lawrence Hubert and Phipps Arabie (1985). "Comparing partitions". Journal of Classification. 2 (1): 193–218. doi:10.1007/BF01908075

Cohen, Jacob (1960). "A coefficient of agreement for nominal scales". Educational and Psychological Measurement. 20 (1): 37–46. doi:10.1177/001316446002000104

Jaccard, Paul (1912). "The distribution of the flora in the alpine zone,” New Phytologist, vol. 11, no. 2, pp. 37–50. doi:10.1111/j.1469-8137.1912.tb05611.x

Nguyen Xuan Vinh, Julien Epps, and James Bailey (2010). Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res. 11 (December 2010), 2837–2854. https://jmlr.org/papers/v11/vinh10a.html

Examples

x <- rep(0:5, 5)
y <- c(rep(0:5, 4), rep(0, 6))
# Simple Matching, or Accuracy
mean(x == y)
#> [1] 0.8333333
# Hamming distance
sum(x != y)
#> [1] 5
rand_index(x, y)
#> [1] 0.8735632
adj_rand(x, y)
#> [1] 0.5188537
cohen_kappa(x, y)
#> [1] 0.8
normalized_mi(x, y)
#> [1] 0.7382948
adjusted_mi(x, y)
#> [1] 0.7213417