K-Means Clustering

Context

This menu allows clustering data in an unsupervised way with the k-means clustering algorithm, in order to find partitions of homogeneous elements.

Originally used only on real variables, this algorithm is extended to use discrete variables as well. Before running the k-means clustering, the user must select the variables on which he wants to apply the algorithm. These variables can be either continuous or discrete but must not be hidden, they cannot be without associated data. Once selected, the clustering tries to regroup the data into the chosen number of clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data as well as in the iterative refinement approach employed by both algorithms.

The values of the selected nodes used by the clustering algorithm are computed as follows:

If the node is continuous and there are continuous values in the database associated with this node, these continuous values are used.
If the node has values associated with its states, these values are used.
If the node is continuous, the center of each interval is used.
If the node is discrete with integer or real states, the values of the states are used.
And, if none of the previous cases is valid, integer values are given to each state starting from 0.

In all cases, Filtered Values are skipped and the data are standardized.

If there are missing values in the database for the selected variables, the corresponding rows are skipped and not used for the clustering. The weights associated with the chosen rows are taken into account.

Once finished, a node Clusters is created with each state corresponding to each cluster found by the algorithm.

The data corresponding to the newly-created node Clusters is added to the database. In each row containing missing values, a missing value is saved for the selected variables.

The following dialog box allows you to enter the desired number of cluster states. For obvious reasons, this number needs to be greater or equal to 2:

Data Clustering Multiple Binary Clustering