Use of this function requires a license for Whitebox Workflows for Python Professional (WbW-Pro). Please visit www.whiteboxgeo.com to purchase a license.
This tool performs a supervised k-nearest neighbour (k-NN) classification using multiple predictor rasters (inputs
), or features, and training data (training
). It can be used to model the spatial distribution of class data, such as land-cover type, soil class, or vegetation type. The training data take the form of an input vector Shapefile containing a set of points or polygons, for which the known class information is contained within a field (field
) of the attribute table. Each grid cell defines a stack of feature values (one value for each input raster), which serves as a point within the multi-dimensional feature space. The algorithm works by identifying a user-defined number (k, -k
) of feature-space neighbours from the training set for each grid cell. The class that is then assigned to the grid cell in the output raster (output
) is then determined as the most common class among the set of neighbours. Note that the knn_regression tool can be used to apply the k-NN method to the modelling of continuous data.
The user has the option to clip the training set data (clip). When this option is selected, each training pixel for which the estimated class value, based on the k-NN procedure, is not equal to the known class value, is removed from the training set before proceeding with labelling all grid cells. This has the effect of removing outlier points within the training set and often improves the overall classification accuracy.
The tool splits the training data into two sets, one for training the classifier and one for testing the classification. These test data are used to calculate the overall accuracy and Cohen's kappa index of agreement, as well as to estimate the variable importance. The test_proportion
parameter is used to set the proportion of the input training data used in model testing. For example, if test_proportion = 0.2
, 20% of the training data will be set aside for testing, and this subset will be selected randomly. As a result of this random selection of test data, the tool behaves stochastically, and will result in a different model each time it is run.
Note that the output image parameter (output
) is optional. When unspecified, the tool will simply report the model accuracy statistics and variable importance, allowing the user to experiment with different parameter settings and input predictor raster combinations to optimize the model before applying it to classify the whole image data set.
Like all supervised classification methods, this technique relies heavily on proper selection of training data. Training sites are exemplar areas/points of known and representative class value (e.g. land cover type). The algorithm determines the feature signatures of the pixels within each training area. In selecting training sites, care should be taken to ensure that they cover the full range of variability within each class. Otherwise the classification accuracy will be impacted. If possible, multiple training sites should be selected for each class. It is also advisable to avoid areas near the edges of class objects (e.g. land-cover patches), where mixed pixels may impact the purity of training site values.
After selecting training sites, the feature value distributions of each class type can be assessed using the evaluate_training_sites tool. In particular, the distribution of class values should ideally be non-overlapping in at least one feature dimension.
The k-NN algorithm is based on the calculation of distances in multi-dimensional space. Feature scaling is essential to the application of k-NN modelling, especially when the ranges of the features are different, for example, if they are measured in different units. Without scaling, features with larger ranges will have greater influence in computing the distances between points. The tool offers three options for feature-scaling (scaling
), including 'None', 'Normalize', and 'Standardize'. Normalization simply rescales each of the features onto a 0-1 range. This is a good option for most applications, but it is highly sensitive to outliers because it is determined by the range of the minimum and maximum values. Standardization rescales predictors using their means and standard deviations, transforming the data into z-scores. This is a better option than normalization when you know that the data contain outlier values; however, it does does assume that the feature data are somewhat normally distributed, or are at least symmetrical in distribution.
Because the k-NN algorithm calculates distances in feature-space, like many other related algorithms, it suffers from the curse of dimensionality. Distances become less meaningful in high-dimensional space because the vastness of these spaces means that distances between points are less significant (more similar). As such, if the predictor list includes insignificant or highly correlated variables, it is advisable to exclude these features during the model-building phase, or to use a dimension reduction technique such as principal_component_analysis to transform the features into a smaller set of uncorrelated predictors.
For a video tutorial on how to use the knn_classification tool, see this YouTube video.
The peak memory usage of this tool is approximately 8 bytes per grid cell × # predictors.
knn_regression, random_forest_classification, svm_classification, parallelepiped_classification, evaluate_training_sites
def knn_classification(self, input_rasters: List[Raster], training_data: Vector, class_field_name: str, scaling_method: str = "none", k: int = 5, test_proportion: float = 0.2, use_clipping: bool = False, create_output: bool = False) -> Optional[Raster]: ...