Contingency Table Fit

Definition

Contingency Table Fit (CTF) measures the quality of the representation of the Joint Probability Distribution by a Bayesian network $B$ compared to a complete (i.e., fully-connected) network $C$ .

BayesiaLab’s CTF is defined as:

$\displaystyle C_B = 100 \times \frac{H_U(\mathcal{D}) - H_B(\mathcal{D})}{H_U(\mathcal{D}) - H_C(\mathcal{D})}$

where

$H_U(\mathcal{D})$ is the entropy of the data with the unconnected network $U$ .
$H_B(\mathcal{D})$ is the entropy of the data with the evaluated network $B$ .
$H_C(\mathcal{D})$ is the entropy of the data with the complete (i.e., fully connected) network $C$ . In the complete network, all nodes are directly connected to all other nodes. Therefore, the complete network $C$ is an exact representation of the chain rule. As such, it does not utilize any conditional independence assumptions for representing the Joint Probability Distribution.

Interpretation

${C_B}$ is equal to 100 if the Joint Probability Distribution is represented without any approximation, i.e., the entropy of the evaluated network $B$ is the same as that obtained with the complete network $C$ .
${C_B}$ is equal to 0 if the Joint Probability Distribution is represented by considering that all the variables are independent, i.e., the entropy of the evaluated network B is the same as the one obtained with the unconnected network $U$ .
${C_B}$ can also be negative if the parameters of network $B$ do not correspond to the dataset.

Q&A

CTF Value

Question

How much weight should one place on the CTF metric reported out in the Data Clustering results, as a measure of the “goodness” of a cluster/segmentation solution? Across about a dozen different datasets (using both latent and manifest variables as inputs) I have never seen a CTF higher than about 11%, which is substantially lower than what is mentioned in the Data Clustering webinar.

I am just trying to understand whether this might be a function of the data, the modeling assumptions/structure I am making/imposing, or something else.

Answer

When used as a quality measure for Data Clustering, the CTF measures how well the states of your Cluster node summarize the Joint. The size of the Joint (defined by the number of variables and states) has then a great impact on the difficulty to get a good summary. The second difficulty comes from the strength of the probabilistic interactions between the variables. The more the relationships are weak, the more difficult it is to summarize the Joint (as probabilities are then spread all over the hypercube).

When used as a measure of the quality for an induced Factor (during Multiple Clustering), we usually used a warning threshold of 75% when trying to summarize a Joint defined by 5 manifests. If the manifests have 5 states, the Joint is then made of 3125 cells. A Factor with 5 states that has a CTF of 75% is then representing 75% of these 3125 cells with only 5 states. Note that for this kind of task, the manifests have been clustered together due to the strength of their interactions.