## Clustering and Contingency Tables

Let *X* be a set of *N* data points
{*x*_{1},* x*_{2}, *x*_{3}, ..., *x _{N}*}.
Given two clusterings of

*X*, namely

*A*={

*A*

_{1},

*A*

_{2},

*A*

_{3}, ...,

*A*} with

_{R}*R*clusters and

*B*={

*B*

_{1},

*B*

_{2},

*B*

_{3}, ...,

*B*} with

_{C}*C*clusters, the information on cluster overlap between

*A*and

*B*can be summarized in the form of a

*R×C*contingency table (CT) as illustrated in Figure 1. Every element of

*X*contributes to the cell of the corresponding clusters in both

*A*and

*B*.

Figure 1: Contingency table.

*n*denotes the number of elements that are common to clusters

_{ij}*A*and

_{i}*B*.

_{j}Focusing on the pairwise agreement, the information in the CT can be further condensed in a mismatch matrix:

Figure 2: Mismatch Matrix.

*a, b, c*and

*d*represent counts of unique entity pairs.

Explicit formulae for calculating *a*, *b*, *c* and *d* in the mismatch matrix can be constructed using entries in the CT (Hubert & Arabie, 1985):