Skip to Content

K-Means Clustering

Context

This menu clusters data in an unsupervised way with the k-means algorithm to find partitions of homogeneous elements.

Originally used only on real variables, this algorithm is extended to use discrete variables as well. Before running k-means clustering, you must select the variables to which the algorithm applies. These variables can be either continuous or discrete, but they must not be hidden and must have associated data. Once selected, the clustering groups the data into the chosen number of clusters so that each observation belongs to the cluster with the nearest mean. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that both attempt to find the centers of natural clusters in the data and use an iterative refinement approach.

The values of the selected nodes used by the clustering algorithm are computed as follows:

  • If the node is continuous and there are continuous values in the database associated with this node, these continuous values are used.
  • If the node has values associated with its states, these values are used.
  • If the node is continuous, the center of each interval is used.
  • If the node is discrete with integer or real states, the values of the states are used.
  • If none of the previous cases applies, integer values are assigned to each state starting from 0.

In all cases, Filtered Values are skipped and the data are standardized.

If there are missing values in the database for the selected variables, the corresponding rows are skipped and not used for the clustering. The weights associated with the chosen rows are taken into account.

Once finished, a node Clusters is created with each state corresponding to each cluster found by the algorithm.

The data corresponding to the newly-created node Clusters is added to the database. In each row containing missing values, a missing value is saved for the selected variables.

The following dialog box allows you to enter the desired number of cluster states. For obvious reasons, this number needs to be greater or equal to 2:

2556773