# Minimum Description Length

"Pluralitas non est ponenda sine neccesitate"

- Friar William of Occam

Occams Razor states that entities should not be multi;plied unnecessarily". One of the major questions when trying to cluster data is how many clusters to create. Sometimes we have an a priori answer to this question based on knowledge of the data set. Other times we have to fit the data into a certain pre-specified number of categories or management units. When exploring a data set for the purpose of discovering relationships within it, however, it is important to avoid preconceived notions of the complexity of a data.

One way to explore the question of how many clusters is to simply try a number of different clusterings and see what provides the most interesting result. We can also use a concept like Occam's Razor to give us guidance. The main point of Occam's Razor applied to clustering is that, at some point, having more clusters for a given data set is not worth the added information it may provide.

The Minimum Description Length [MDL] principle is a mathematical method for applying Occam's Razor to models for data--a set of clusters is a model for a given data set. The MDL principle says that the the model that takes the least number of bits to represent is the best model for a set of data. In the case of clusters, we can encode the number of cluster parameters and the representational error as the amount of information it takes to represent the data. When these two are balanced, then we have the optimal number of clusters.

The MDL tab in WLV allows the user to execute an MDL analysis of a particular data set. It calculates many clusterings with different numbers of clusters and calculates the description length for each run. It then provides a plot of the description length values and suggests a range of values for the number of clusters to use. The screen shot below demonstrates a typical result for the AustraliaCoast data set.

In our experience with the MDL tool, the number of suggested clusters tends to be higher than experts have found useful. However, the low end of the suggested MDL range tends to be within an acceptable range. If you consider the graph of MDL values, it is clear that from 10 to 16 clusters the descriptions lengths all fall within a similar range. For this data set, experts have found 10-12 clusters to be a useful number. Below that, important features in the coastline get merged together and lost. Above 16, the value of additional clusters to human analysis is unclear.

To execute an MDL analysis, first select the data set to analyze. Then enter a starting and ending number of clusters to examine. Note that as the number of clusters gets higher, the clustering takes longer. Keep the End number of clusters as small as is reasonable for faster analysis. Once these fields are set, then click on Do MDL. When the analysis is complete, you can click on the MDL File in the list box and then click View to see the chart and plot of the MDL results. Clicking on the Variables button will show you the active variables for that particular MDL analysis. Note that you will usually get different MDL results for different variables.

Use the MDL tool as a guide in your exploration. As with all tools, use your judgement as to whether the results make sense. If they do not make sense, figuring out why can often lead to new insights about the data set.