Dataset Review, Refinement, and Editing:

Reviewing the values and characteristics of a selected variable
Filtering (including, excluding, and modifying only specific values for a variable)
Transforming variables (systematic mathematical modification of values)
Analyzing relationships (correlation matrices and scatter plots)

Once the geographic range, the cell type(s) and the desired variables have been selected from the environmental database, the variable review and summary page offers a number of options.

Reviewing the values and characteristics of a selected variable

The "info" button immediately following the variable name will display a statistical summary of the data set for the selected variable in the selected region. This includes such information as the mean,standard deveiation, maximum and minimum values, total cells selected and total populated with data for the variable, etc. Also displayed is a ten-interval histogram plot of the variable values. The histogram display can be modified to have different limiting values or numbers of intervals.

The summary information and the histogram permit the user to see if the variable conforms to expectations and/or looks useful for the intended purposes. The histogram also shows whetehr the distribution of values is normal or highly skewed -- an important consideration for use in clustering exercises or off-line statistical analysis.

Below the histogram is an additional feature that permits the user to view the values as transformed by certain selected functions -- probably the most commonly used are the logarithmic (base 10 and natural) transforms, which cn often be used to 'normalize' skewed distributions. This viewing option does not transform the actual data set -- that is done by the transform option discussed below.

Filtering (including, excluding, or modifying only specific values for a variable)

Why filter?  There are a number of reasons why the user may wish to modify a dataset by “filtering” --selecting only a certain range of values for one or more of the variables. Examples include:

  1. To apply climatic rather than geographic definitions of the region to be analyzed – for example, areas where the mean annual precipitation is at least 1000 mm, or where the mean monthly temperature never falls below zero.
  2. To eliminate ‘outlier’ values that dominate the clustering process but are not significant to the analysis of interest.
  3. To tailor a geographic region of analysis to an irregular shape.

The first two examples can conveniently be done either within the database (the on-line filter option) or off-line, in a dataset downloaded for modification and uploading. The geographic range definition is best done off-line at present.

On-line filtering: After the variables have been selected, proceed to the variable review page, where a "Filter" button is available next to the listing of each variable. On the same line as the "Info" button is a dropdown menu labelled "operator." For a description of the available operators and their uses, click here.

Repeat the process for as many variables as desired (it is not necessary to filter all variables, but the data set will be treated as filtered if any component is).

This variable review page confirms the geographic range, cell type, and variables selected. It also offers a choice of "No Null" for each of the variables. If this box is checked, any cells that have no value associated with that variable (indicated by -9999 in the database entry) will be dropped from the final data set. In general, elimination of null values will make a more statistically satisfying cluster group, but at the expense of omitting parts of the geographic visualization.

Once you have made the decisions at this stage, proceed to the Generate Cluster Data step.

Off-line filtering: Although the data selection process provides basic filter capabilities, it will never be possible to provide every kind of tool that the advanced user might desire.  Fortunately, the LOICZVIEW capability to accept uploaded datasets, in combination with the database download option, permits the user to adjust data sets using relatively simple spreadsheet operations. At present, offline filtering is the only practical way to modify a geographic range to an irregular (non-rectangular) shape. The following example provides a procedural outline.

EXAMPLE:
Clustering of the Australia-New Zealand region yielded poor results for hydrologic variables when the standard geographic region selection (Zones 21 and 26) was used.  This was because the rectangular lat-long boxes that include all of Australia and NZ also include portions of Indonesia and New Guinea, with a very different rainfall regime. Use of the coordinate selection boxes can not solve the problem, because a rectangular box that includes all of Australia (South of 10 degrees S latitude) still clips enough of Indonesia to skew the data distribution.

The following procedure can be used to adjust the Australia- NZ geographic region:

  1. Select the desired variables and cell types for 10-47 S Lat, 110-180 E Long, and select View/Download when the data set is assembled. 
  2. Save the resulting file as a text file (e.g. select all, copy, paste into Notepad or a similar application).
  3. Open a spreadsheet and import the text file. For Excel, open a new worksheet, 'open file' (specify ".txt" for type) on the saved data file, and the Wizard should step through the choices for opening the comma-delimited files into a spreadsheet.
  4. Select the entire sheet, and 'sort data' by Latitude in ascending order, then by Longitude in ascending order.  This places the problem portion of the geographic range (the northwest corner) at the top of the spreadsheet.
  5. Identify from a separate source the latitude and longitude ranges to be excluded (in this case, 10-12 S between 110 and 127 E and between 145 and 155. If we want to exclude New Caledonia, remove latitudes above 25 S E of 155 E Longitude).
  6. Delete the rows with lat-long values in these ranges (for complex shapes, it may be easiest to re-sort the table to address different parts).
  7. Upload the edited database to LOICZVIEW for clustering, following the instructions at the example upload site or within LOICZVIEW.

Once a geographic template is constructed, it can be applied to future dataset downloads.

Transforming variables (systematic mathematical modification of values)

These filters allow you to modify values, in order to avoid the effects of extreme variance or skewed distribution on the clustering.  Transforms operate on the entire data set.  The presently available transformations are log base 10, natural log, absolute value, and square root. Click here for more detailed descriptions/instructions.

You may use the transform function to alter datasets after they are filtered or modified, but you cannot filter/modify datasets after they are transformed.

The “Info” button in the “Variable” name box permits you to view the original dataset characteristics and distribution, and to preview some of the filters or transforms you might desire, but it does not describe or view the actual filtered or transformed dataset.

Analyzing relationships (correlation matrices and scatter plots)

At the bottom of the VERIFY, EXAMINE, AND/OR MODIFY SELECTED VARIABLES page is a button labelled "compute correlation matrix" -- this calculates a pairwise correlation coefficient for ech pair of variables selected, and displays the resultis in a matrix, along with a summary of the variables used and any modifications to them made with the 'exclude null' choice or any of the include/exclude, reet, or transform operators.

Within the matrix, the numerical values of the coefficients are presented as hypertext links. Clicking on these links will produce a scatterplot of the the two variables represented.

At the bottom of the correlation matrix dispay page is a View or download variable correlation file link; c;licking this produces a downloadable .csv file of the matirix and supporting information.