Kansas Geological Survey, Computer Contributions 38, originally published in 1969

FORTRAN II Programs for 8 Methods of Cluster Analysis (CLUSTAN I)

by David Wishart

University of St. Andrews

small image of the cover of the book; goldenrod paper with red text.

Originally published in 1969 as Kansas Geological Survey Computer Contributions 38.

Introduction

The intensity of study of numerical classification methods during the last decade is probably due to 3 factors: (a) the difficulty (perhaps impossibility) of defining a general model for a wide range of applications and all types of data, (b) the theoretical problems arising from nonstandard data distributions (the mere existence of a classification problem practically implies heterogeneous data), and (c) the advent of widespread computational facilities. In the absence of a generally accepted theory of classification, considerable emphasis has been placed on intuitive rules and empirically justified procedures. From such weak theses, it is inevitable that at any indecisive stage there will be recourse to the original trial-and-error techniques. Consequently, researchers who discover new classification problems, or measure new data sets for existing subjects, will often only be satisfied completely by the familiar many-method comparative exercise. Such experiments require computational facilities and ready made programs due to the enormous number of calculations required by these procedures—CLUSTAN I has been developed to meet these demands.

Following the decision to prepare a comprehensive suite of classification programs, it is immediately apparent that there are certain generalizations which can be made to simplify the design of the system. For example, most clustering methods require the initial computation of a similarity matrix—hence we can confine this calculation to one general routine (CORREL), and use smaller programs for each of the individual methods. In order to allow for the addition to the system of those methods which do not use a similarity matrix, a separate initial routine (FILE) should be used to control all data input, transformations and evaluation of background statistics. Finally, because the results obtained from the clustering programs can be expressed in a standard form, a single routine (RESULT) can be designed to control all necessary cluster interpretation functions. In general, an ideal clustering program library should be flexible sufficiently to be used with all types of data, allow for the introduction of additional clustering programs, handle reasonably sized problems and be defined suitably to have general machine accessibility.

The present set of programs (CLUSTAN I) was developed to meet these considerations as a design system from which could evolve a more sophisticated version—CLUSTAN II. Although it contains some major omissions, such as the provision for 'missing' data, CLUSTAN I is reasonably flexible, and can classify by the following 8 standard methods:

Nearest neighbor
Farthest neighbor
Group average
Centroid
Median
Ward's error sum
Lance-Williams flexible
Mode analysis

The configuration 1 - 8 currently is being extended to include the following additional methods:

Information analysis
k-dendrogram ultrametric
Association analysis
Divisive and agglomerative group analysis.

The present interest in CLUSTAN I lies mainly in testing its ease of implementation, and extent of usage. Such experience which is gained from the use of the system will contribute, together with suggestions concerning its improvement, toward the definition and development of CLUSTAN II in 1970.

The complete text of this report is available as an Adobe Acrobat PDF file.

Read the PDF version (13.8 MB)

Kansas Geological Survey
Placed on web Sept. 11, 2019; originally published 1969.
Comments to webadmin@kgs.ku.edu
The URL for this page is http://www.kgs.ku.edu/Publications/Bulletins/CC/38/index.html