Use of this function requires a license for Whitebox Workflows for Python Professional (WbW-Pro). Please visit www.whiteboxgeo.com to purchase a license.
This tool performs a supervised random forest (RF) classification using multiple predictor rasters (inputs
), or features, and training data (training
). It can be used to model the spatial distribution of class data, such as land-cover type, soil class, or vegetation type. The training data take the form of an input vector Shapefile containing a set of points or polygons, for which the known class information is contained within a field (class_field_name
) of the attribute table. Each grid cell defines a stack of feature values (one value for each input raster), which serves as a point within the multi-dimensional feature space. Random forest is an ensemble learning method that works by creating a large number (n_trees
) of decision trees and using a majority vote to determine estimated class values. Individual trees are created using a random sub-set of predictors. This ensemble approach overcomes the tendency of individual decision trees to overfit the training data. As such, the RF method is a widely and successfully applied machine-learning method in many domains.
Note that this function is part of a set of two tools, including random_forest_classification_fit and random_forest_classification_prdict. The random_forest_classificaiton_fit tool should be used first to create the RF model and the random_forest_classification_predict can then be used to apply that model for prediction. The output of the fit tool is a byte array that is a binary representation of the RF model. This model can then be used as the input to the predict tool, along with a list of input raster predictors, which must be in the same order as those used in the fit tool. The output of the predict tool is a classified raster. The reason that the RF workflow is split in this way is that often it is the case that you need to experiment with various input predictor sets and parameter values to create an adequate model. There is no need to generate an output classified raster during this experimentation stage, and because prediction can often be the slowest part of the RF modelling process, it is generally only performed after the final model has been identified. The binary representation of the RF-based model can be serialized (i.e., saved to a file) and then later read back into memory to serve as the input for the prediction step of the workflow (see code example below).
Also note that this tool is for RF-based classification. There is a similar set of fit and *predict tools available for performing RF-based regression, including random_forest_regression_fit and random_forest_regression_predict. These tools are more appropriately applied to the modelling of continuous data, rather than categorical data.
The user must specify the splitting criteria (split_criterion
) used in training the decision trees. Options for this parameter include 'Gini', 'Entropy', and 'ClassificationError'. The model can also be adjusted based on each of the number of trees (n_trees
), the minimum number of samples required to be at a leaf node (min_samples_leaf
), and the minimum number of samples required to split an internal node (min_samples_split
) parameters.
The tool splits the training data into two sets, one for training the classifier and one for testing the model. These test data are used to calculate the overall accuracy and Cohen's kappa index of agreement, as well as to estimate the variable importance. The test_proportion
parameter is used to set the proportion of the input training data used in model testing. For example, if test_proportion = 0.2
, 20% of the training data will be set aside for testing, and this subset will be selected randomly. As a result of this random selection of test data, and the random selection of features used in decision tree creation, the tool is inherently stochastic, and will result in a different model each time it is run.
Like all supervised classification methods, this technique relies heavily on proper selection of training data. Training sites are exemplar areas/points of known and representative class value (e.g. land cover type). The algorithm determines the feature signatures of the pixels within each training area. In selecting training sites, care should be taken to ensure that they cover the full range of variability within each class. Otherwise the classification accuracy will be impacted. If possible, multiple training sites should be selected for each class. It is also advisable to avoid areas near the edges of class objects (e.g. land-cover patches), where mixed pixels may impact the purity of training site values.
After selecting training sites, the feature value distributions of each class type can be assessed using the evaluate_training_sites tool. In particular, the distribution of class values should ideally be non-overlapping in at least one feature dimension.
RF, like decision trees, does not require feature scaling. That is, unlike the k-NN algorithm and other methods that are based on the calculation of distances in multi-dimensional space, there is no need to rescale the predictors onto a common scale prior to RF analysis. Because individual trees do not use the full set of predictors, RF is also more robust against the curse of dimensionality than many other machine learning methods. Nonetheless, there is still debate about whether or not it is advisable to use a large number of predictors with RF analysis and it may be better to exclude predictors that are highly correlated with others, or that do not contribute significantly to the model during the model-building phase. A dimension reduction technique such as principal_component_analysis can be used to transform the features into a smaller set of uncorrelated predictors.
import os from whitebox_workflows import WbEnvironment
license_id = 'floating-license-id' wbe = WbEnvironment(license_id)
try: wbe.verbose = True wbe.working_directory = "/path/to/data"
# Read the input raster files into memory images = wbe.read_rasters( 'LC09_L1TP_018030_20220614_20220615_02_T1_B2.TIF', 'LC09_L1TP_018030_20220614_20220615_02_T1_B3.TIF', 'LC09_L1TP_018030_20220614_20220615_02_T1_B4.TIF', 'LC09_L1TP_018030_20220614_20220615_02_T1_B5.TIF' ) # Read the input training polygons into memory training_data = wbe.read_vector('training_data.shp') # Train the model model = wbe.random_forest_classification_fit( images, training_data, class_field_name = 'CLASS', split_criterion = "Gini", n_trees = 50, min_samples_leaf = 1, min_samples_split = 2, test_proportion = 0.2 ) # Example of how to serialize the model, i.e., save the model, which is just binary data print('Saving the model to file...') file_path = os.path.join(wbe.working_directory, "rf_model.bin") with open(file_path, "wb") as file: file.write(bytearray(model)) # Example of how to deserialize the model, i.e. read the model model = [] with open(file_path, mode='rb') as file: model = list(file.read()) # Use the model to predict rf_class_image = wbe.random_forest_classification_predict(images, model) wbe.write_raster(rf_class_image, 'rf_classification.tif', compress=True) print('All done!')
except Exception as e: print("The error raised is: ", e) finally: wbe.check_in_license(license_id)
random_forest_classification_predict, random_forest_regression_fit, random_forest_regression_predict, knn_classification, svm_classification, parallelepiped_classification, evaluate_training_sites
def random_forest_classification_fit(self, input_rasters: List[Raster], training_data: Vector, class_field_name: str, split_criterion: str = "gini", n_trees: int = 500, min_samples_leaf: int = 1, min_samples_split: int = 2, test_proportion: float = 0.2) -> List[int]: ...