Use of this function requires a license for Whitebox Workflows for Python Professional (WbW-Pro). Please visit www.whiteboxgeo.com to purchase a license.
This tool performs a supervised random forest (RF) classification using multiple predictor rasters (inputs
), or features, and training data (training
). It can be used to model the spatial distribution of class data, such as land-cover type, soil class, or vegetation type. The training data take the form of an input vector Shapefile containing a set of points or polygons, for which the known class information is contained within a field (field
) of the attribute table. Each grid cell defines a stack of feature values (one value for each input raster), which serves as a point within the multi-dimensional feature space. Random forest is an ensemble learning method that works by creating a large number (n_trees
) of decision trees and using a majority vote to determine estimated class values. Individual trees are created using a random sub-set of predictors. This ensemble approach overcomes the tendency of individual decision trees to overfit the training data. As such, the RF method is a widely and successfully applied machine-learning method in many domains. Note that the random_forest_regression tool can be used to apply the RF method to the modelling of continuous data.
The user must specify the splitting criteria (split_criterion
) used in training the decision trees. Options for this parameter include 'Gini', 'Entropy', and 'ClassificationError'. The model can also be adjusted based on each of the number of trees (n_trees
), the minimum number of samples required to be at a leaf node (min_samples_leaf
), and the minimum number of samples required to split an internal node (min_samples_split
) parameters.
The tool splits the training data into two sets, one for training the classifier and one for testing the model. These test data are used to calculate the overall accuracy and Cohen's kappa index of agreement, as well as to estimate the variable importance. The test_proportion
parameter is used to set the proportion of the input training data used in model testing. For example, if test_proportion = 0.2
, 20% of the training data will be set aside for testing, and this subset will be selected randomly. As a result of this random selection of test data, and the random selection of features used in decision tree creation, the tool is inherently stochastic, and will result in a different model each time it is run.
Note that the output image parameter (output
) is optional. When unspecified, the tool will simply report the model accuracy statistics and variable importance, allowing the user to experiment with different parameter settings and input predictor raster combinations to optimize the model before applying it to classify the whole image data set.
Like all supervised classification methods, this technique relies heavily on proper selection of training data. Training sites are exemplar areas/points of known and representative class value (e.g. land cover type). The algorithm determines the feature signatures of the pixels within each training area. In selecting training sites, care should be taken to ensure that they cover the full range of variability within each class. Otherwise the classification accuracy will be impacted. If possible, multiple training sites should be selected for each class. It is also advisable to avoid areas near the edges of class objects (e.g. land-cover patches), where mixed pixels may impact the purity of training site values.
After selecting training sites, the feature value distributions of each class type can be assessed using the evaluate_training_sites tool. In particular, the distribution of class values should ideally be non-overlapping in at least one feature dimension.
RF, like decision trees, does not require feature scaling. That is, unlike the k-NN algorithm and other methods that are based on the calculation of distances in multi-dimensional space, there is no need to rescale the predictors onto a common scale prior to RF analysis. Because individual trees do not use the full set of predictors, RF is also more robust against the curse of dimensionality than many other machine learning methods. Nonetheless, there is still debate about whether or not it is advisable to use a large number of predictors with RF analysis and it may be better to exclude predictors that are highly correlated with others, or that do not contribute significantly to the model during the model-building phase. A dimension reduction technique such as principal_component_analysis can be used to transform the features into a smaller set of uncorrelated predictors.
The peak memory usage of this tool is approximately 8 bytes per grid cell × # predictors.
random_forest_regression, knn_classification, svm_classification, parallelepiped_classification, evaluate_training_sites, principal_component_analysis
def random_forest_classification(self, input_rasters: List[Raster], training_data: Vector, class_field_name: str, split_criterion: str = "gini", n_trees: int = 500, min_samples_leaf: int = 1, min_samples_split: int = 2, test_proportion: float = 0.2, create_output: bool = False) -> Optional[Raster]: ...