Use of this function requires a license for Whitebox Workflows for Python Professional (WbW-Pro). Please visit www.whiteboxgeo.com to purchase a license.
This tool performs a supervised random forest (RF) regression analysis using multiple predictor rasters (inputs
), or features, and training data (training
). It can be used to model the spatial distribution of continuous data, such as soil properties (e.g. percent sand/silt/clay). The training data take the form of an input vector Shapefile containing a set of points, for which the known outcome information is contained within a field (field
) of the attribute table. Each grid cell defines a stack of feature values (one value for each input raster), which serves as a point within the multi-dimensional feature space. Random forest is an ensemble learning method that works by creating a large number (n_trees
) of decision trees and using an averaging of each tree to determine estimated outcome values. Individual trees are created using a random sub-set of predictors. This ensemble approach overcomes the tendency of individual decision trees to overfit the training data. As such, the RF method is a widely and successfully applied machine-learning method in many domains. Note that the random_forest_classification tool can be used to apply the RF method to the modelling of categorical (class) data.
Users must specify the number of trees (n_trees
), the minimum number of samples required to be at a leaf node (min_samples_leaf
), and the minimum number of samples required to split an internal node (min_samples_split
) parameters, which determine the characteristics of the resulting model.
The tool splits the training data into two sets, one for training the model and one for testing the prediction. These test data are used to calculate the regression accuracy statistics, as well as to estimate the variable importance. The test_proportion
parameter is used to set the proportion of the input training data used in model testing. For example, if test_proportion = 0.2
, 20% of the training data will be set aside for testing, and this subset will be selected randomly. As a result of this random selection of test data, as well as the randomness involved in establishing the individual decision trees, the tool in inherently stochastic, and will result in a different model each time it is run.
Note that the output image parameter (output
) is optional. When unspecified, the tool will simply report the model accuracy statistics and variable importance, allowing the user to experiment with different parameter settings and input predictor raster combinations to optimize the model before applying it to model the outcome variable across the whole region defined by image data set.
RF, like decision trees, does not require feature scaling. That is, unlike the k-NN algorithm and other methods that are based on the calculation of distances in multi-dimensional space, there is no need to rescale the predictors onto a common scale prior to RF analysis. Because individual trees do not use the full set of predictors, RF is also more robust against the curse of dimensionality than many other machine learning methods. Nonetheless, there is still debate about whether or not it is advisable to use a large number of predictors with RF analysis and it may be better to exclude predictors that are highly correlated with others, or that do not contribute significantly to the model during the model-building phase. A dimension reduction technique such as principal_component_analysis can be used to transform the features into a smaller set of uncorrelated predictors.
For a video tutorial on how to use the random_forest_regression tool, see this YouTube video.
The peak memory usage of this tool is approximately 8 bytes per grid cell × # predictors.
random_forest_classification, knn_regression, svm_regression, principal_component_analysis