Use of this function requires a license for Whitebox Workflows for Python Professional (WbW-Pro). Please visit www.whiteboxgeo.com to purchase a license.
This function performs a supervised random forest (RF) regression analysis using multiple predictor rasters (input_rasters
), or features, and training data (training_data
). The training data take the form of an input vector Shapefile containing a set of points, for which the known outcome information is contained within a field (field_name
) of the attribute table. Each grid cell defines a stack of feature values (one value for each input raster), which serves as a point within the multi-dimensional feature space.
Note that this function is part of a set of two tools, including random_forest_regression_fit and random_forest_regression_prdict. The random_forest_classificaiton_fit tool should be used first to create the RF model and the random_forest_regression_predict can then be used to apply that model for prediction. The output of the fit tool is a byte array that is a binary representation of the RF model. This model can then be used as the input to the predict tool, along with a list of input raster predictors, which must be in the same order as those used in the fit tool. The output of the predict tool is a continous raster. The reason that the RF workflow is split in this way is that often it is the case that you need to experiment with various input predictor sets and parameter values to create an adequate model. There is no need to generate an output raster during this experimentation stage. Because prediction can often be the slowest part of the RF modelling process, it is generally only performed after the final model has been identified. The binary representation of the RF-based model can be serialized (i.e., saved to a file) and then later read back into memory to serve as the input for the prediction step of the workflow (see code example below).
Also note that this tool is for RF-based regression analysis. There is a similar set of fit and *predict tools available for performing RF-based classification, including random_forest_classification_fit and random_forest_classification_predict. These tools are more appropriately applied to the modelling of categorical data, rather than continuous data.
Note: it is very important that the order of feature rasters is the same for both fitting the model and using the model for prediction. It is possible to use a model fitted to one data set to make preditions for another data set, however, the set of feature reasters specified to the prediction tool must be input in the same sequence used for building the model. For example, one may train a RF regressor on one set of land-surface parameters and then apply that model to predict the spatial distribution of a soil property on a land-surface parameter stack derived for a different landscape, but the image band sequence must be the same for the Fit/Predict tools otherwise inaccurate predictions will result.
Random forest is an ensemble learning method that works by creating a large number (n_trees
) of decision trees and using an averaging of each tree to determine estimated outcome values. Individual trees are created using a random sub-set of predictors. This ensemble approach overcomes the tendency of individual decision trees to overfit the training data. As such, the RF method is a widely and successfully applied machine-learning method in many domains.
Users must specify the number of trees (n_trees
), the minimum number of samples required to be at a leaf node (min_samples_leaf
), and the minimum number of samples required to split an internal node (min_samples_split
) parameters, which determine the characteristics of the resulting model.
The function splits the training data into two sets, one for training the model and one for testing the prediction. These test data are used to calculate the regression accuracy statistics, as well as to estimate the variable importance. The test_proportion
parameter is used to set the proportion of the input training data used in model testing. For example, if test_proportion = 0.2
, 20% of the training data will be set aside for testing, and this subset will be selected randomly. As a result of this random selection of test data, as well as the randomness involved in establishing the individual decision trees, the tool in inherently stochastic, and will result in a different model each time it is run.
RF, like decision trees, does not require feature scaling. That is, unlike the k-NN algorithm and other methods that are based on the calculation of distances in multi-dimensional space, there is no need to rescale the predictors onto a common scale prior to RF analysis. Because individual trees do not use the full set of predictors, RF is also more robust against the curse of dimensionality than many other machine learning methods. Nonetheless, there is still debate about whether or not it is advisable to use a large number of predictors with RF analysis and it may be better to exclude predictors that are highly correlated with others, or that do not contribute significantly to the model during the model-building phase. A dimension reduction technique such as principal_component_analysis
can be used to transform the features into a smaller set of uncorrelated predictors.
For a video tutorial on how to use the RandomForestRegression
tool, see this YouTube video.
import os from whitebox_workflows import WbEnvironment
license_id = 'floating-license-id' wbe = WbEnvironment(license_id)
try: wbe.verbose = True wbe.working_directory = "/path/to/data"
# Read the input raster files into memory images = wbe.read_rasters( 'DEV.tif', 'profile_curv.tif', 'tan_curv.tif', 'slope.tif' ) # Read the input training polygons into memory training_data = wbe.read_vector('Ottawa_soils_data.shp') # Train the model model = wbe.random_forest_regression_fit( images, training_data, field_name = 'Sand', n_trees = 50, min_samples_leaf = 1, min_samples_split = 2, test_proportion = 0.2 ) # Example of how to serialize the model, i.e., save the model, which is just binary data print('Saving the model to file...') file_path = os.path.join(wbe.working_directory, "rf_model.bin") with open(file_path, "wb") as file: file.write(bytearray(model)) # Example of how to deserialize the model, i.e. read the model model = [] with open(file_path, mode='rb') as file: model = list(file.read()) # Use the model to predict rf_image = wbe.random_forest_regression_predict(images, model) wbe.write_raster(rf_image, 'rf_regression.tif', compress=True) print('All done!')
except Exception as e: print("The error raised is: ", e) finally: wbe.check_in_license(license_id)
random_forest_regression_predict, random_forest_classification_fit, random_forest_classification_predict, knn_classification, svm_classification, parallelepiped_classification, evaluate_training_sites
def random_forest_regression_fit(self, input_rasters: List[Raster], training_data: Vector, field_name: str, n_trees: int = 500, min_samples_leaf: int = 1, min_samples_split: int = 2, test_proportion: float = 0.2) -> List[int]: ...