License Information

Use of this function requires a license for Whitebox Workflows for Python Professional (WbW-Pro). Please visit www.whiteboxgeo.com to purchase a license.

Description

This tool performs a support vector machine (SVM) binary classification using multiple predictor rasters (inputs), or features, and training data (training). SVMs are a common class of supervised learning algorithms widely applied in many problem domains. This tool can be used to model the spatial distribution of class data, such as land-cover type, soil class, or vegetation type. The training data take the form of an input vector Shapefile containing a set of points or polygons, for which the known class information is contained within a field (field) of the attribute table. Each grid cell defines a stack of feature values (one value for each input raster), which serves as a point within the multi-dimensional feature space. Note that the svm_regression tool can be used to apply the SVM method to the modelling of continuous data.

The user must specify the values of three parameters used in the development of the model, the c parameters (-c), gamma (gamma), and the tolerance (tolerance). The c-value is the regularization parameter used in model optimization. The gamma parameter defines the radial basis function (Gaussian) kernel parameter. The tolerance parameter controls the stopping condition used during model optimization.

The tool splits the training data into two sets, one for training the classifier and one for testing the classification. These test data are used to calculate the overall accuracy and Matthew correlation coefficient (MCC). The test_proportion parameter is used to set the proportion of the input training data used in model testing. For example, if test_proportion = 0.2, 20% of the training data will be set aside for testing, and this subset will be selected randomly. As a result of this random selection of test data, the tool behaves stochastically, and will result in a different model each time it is run.

Note that the output image parameter (output) is optional. When unspecified, the tool will simply report the model accuracy statistics, allowing the user to experiment with different parameter settings and input predictor raster combinations to optimize the model before applying it to classify the whole image data set.

Like all supervised classification methods, this technique relies heavily on proper selection of training data. Training sites are exemplar areas/points of known and representative class value (e.g. land cover type). The algorithm determines the feature signatures of the pixels within each training area. In selecting training sites, care should be taken to ensure that they cover the full range of variability within each class. Otherwise the classification accuracy will be impacted. If possible, multiple training sites should be selected for each class. It is also advisable to avoid areas near the edges of class objects (e.g. land-cover patches), where mixed pixels may impact the purity of training site values.

After selecting training sites, the feature value distributions of each class type can be assessed using the evaluate_training_sites tool. In particular, the distribution of class values should ideally be non-overlapping in at least one feature dimension.

The SVM algorithm is based on the calculation of distances in multi-dimensional space. Feature scaling is essential to the application of SVM-based modelling, especially when the ranges of the features are different, for example, if they are measured in different units. Without scaling, features with larger ranges will have greater influence in computing the distances between points. The tool offers three options for feature-scaling (scaling), including 'None', 'Normalize', and 'Standardize'. Normalization simply rescales each of the features onto a 0-1 range. This is a good option for most applications, but it is highly sensitive to outliers because it is determined by the range of the minimum and maximum values. Standardization rescales predictors using their means and standard deviations, transforming the data into z-scores. This is a better option than normalization when you know that the data contain outlier values; however, it does does assume that the feature data are somewhat normally distributed, or are at least symmetrical in distribution.

Because the SVM algorithm calculates distances in feature-space, like many other related algorithms, it suffers from the curse of dimensionality. Distances become less meaningful in high-dimensional space because the vastness of these spaces means that distances between points are less significant (more similar). As such, if the predictor list includes insignificant or highly correlated variables, it is advisable to exclude these features during the model-building phase, or to use a dimension reduction technique such as principal_component_analysis to transform the features into a smaller set of uncorrelated predictors.

Memory Usage

The peak memory usage of this tool is approximately 8 bytes per grid cell × # predictors.

See Also

random_forest_classification, knn_classification, parallelepiped_classification, evaluate_training_sites, principal_component_analysis

Project Links

WbW Homepage User Manual Support WbW