Search this blog

Sunday 10 April 2011

Counting clusters: part III

This process uses agglomerative clustering to create clusters.

As before, the same artificial data as with the k-means and DBScan approaches is used with some of the same cluster validity measures.

The agglomerative clustering operator can create its clusters in three modes. These are single link, average link and complete link. The process iterates over these options in addition to the value of k, the number of clusters.

To extract a cluster from an agglomerative model, the operator "Flatten clustering" must be used.

To make the process more efficient, it would be possible to run the clustering outside the loop operator but that is an exercise for another day. Using the date_now() function within a "Generate attributes" operator would be one way to allow this to be timestamped for those interested in the improvement.

This process also uses the tricky "Pivot" operator. This allows the validity measures for the different modes to be presented as attributes in their own right. To perform this operation, it is necessary to use the "Log to data" operator that turns everything that has been logged into an example set.

Here is an example graph to show how the validity measures vary as k varies. As before, the "correct" answer is 8.

Sunday 3 April 2011

How average within cluster distance is calculated for numeric attributes

The performance for each cluster is calculated first. The cluster density performance operator calculates the average distance between points in a cluster and multiplies this by the number of points minus 1. Euclidean distance is used as the distance measure.

So a cluster containing 1 point would have an average distance of 0 because there no other points.

A cluster containing 2 points would have a density equal to the distance between the points.

Here's a worked example for 3 points. Firstly, here are the points


 This table shows the Euclidean distance between them


For example SqrRoot ((-8.625-2.468)^2 + (-5.590-7.267)^2) = 16.981

The average distance of these is 13.726 and the average within distance performance if these three points were in a cluster would be (3-1) times that or 27.452. I'm not quite sure why this multiplication happens (and I have done experiments using 4 points to confirm it and I've looked at the code). I'll leave this comment as a place holder for another day.

Here is what RapidMiner shows.


The negative value is imposed by RapidMiner and seems to be because the best density is the smallest absolute value so negating this means the best density would be the maximum enabling it to be used as a stopping criterion during optimisation.

The performance for all clusters is calculated by summing each cluster performance weighted by the number of points in each and dividing by the number of examples. For example, three clusters with performances -9.066, 0 and -8.943 containing 2, 1 and 3 points respectively would give a result of -7.493 i.e. (2*-9.066+1*0+3*-8.943)/(2+1+3).

What these values mean is a story for another day.

How item distribution performance is calculated for numeric attributes

The item distribution performance operator calculates its performance in the sum of squares case very simply. The number of examples in each cluster is divided by the total number of examples in all clusters. This is squared and the values for each cluster are summed.

For example, three clusters containing 1, 2 and 3 items would have an item distribution score of 0.389. This is 1/36 + 4/36 + 9/36 = 14/36 = 0.389.

For a situation where one cluster dominates and the others clusters are very small in comparison, this value will tend to 1. For the opposite situation, where the clusters have equal numbers of examples, the value tends to 1/N where N is the number of clusters.