bayesia logo

CTF Value

Question

How much weight should one place on the CTF metric reported out in the Data Clustering results, as a measure of the "goodness" of a cluster/segmentation solution? Across about a dozen different datasets (using both latent and manifest variables as inputs) I have never seen a CTF higher than about 11%, which is substantially lower than what is mentioned in the Data Clustering webinar.

I am just trying to understand whether this might be a function of the data, the modeling assumptions/structure I am making/imposing, or something else.

Answer

When used as a quality measure for Data Clustering, the CTF measures how well the states of your Cluster node summarize the Joint. The size of the Joint (defined by the number of variables and states) has then a great impact on the difficulty to get a good summary. The second difficulty comes from the strength of the probabilistic interactions between the variables. The more the relationships are weak, the more difficult it is to summarize the Joint (as probabilities are then spread all over the hypercube).

When used as a measure of the quality for an induced Factor (during Multiple Clustering), we usually used a warning threshold of 75% when trying to summarize a Joint defined by 5 manifests. If the manifests have 5 states, the Joint is then made of 3125 cells. A Factor with 5 states that has a CTF of 75% is then representing 75% of these 3125 cells with only 5 states. Note that for this kind of task, the manifests have been clustered together due to the strength of their interactions.


Copyright © 2025 Bayesia S.A.S., Bayesia USA, LLC, and Bayesia Singapore Pte. Ltd. All Rights Reserved.