Use of this function requires a license for Whitebox Workflows for Python Professional (WbW-Pro). Please visit www.whiteboxgeo.com to purchase a license.
This tool performs an unsupervised DBSCAN clustering operation, based on a series of input rasters (inputs
). Each grid cell defines a stack of feature values (one value for each input raster), which serves as a point within the multi-dimensional feature space. The DBSCAN algorithm identifies clusters in feature space by identifying regions of high density (core points) and the set of points connected to these high-density areas. Points in feature space that are not connected to high-density regions are labeled by the DBSCAN algorithm as 'noise' and the associated grid cell in the output raster (output
) is assigned the nodata value. Areas of high density (i.e. core points) are defined as those points for which the number of neighbouring points within a search distance (search_dist
) is greater than some user-defined minimum threshold (min_points
).
The main advantages of the DBSCAN algorithm over other clustering methods, such as k-means (k_means_clustering), is that 1) you do not need to specify the number of clusters a priori, and 2) that the method does not make assumptions about the shape of the cluster (spherical in the k-means method). However, DBSCAN does assume that the density of every cluster in the data is approximately equal, which may not be a valid assumption. DBSCAN may also produce unsatisfactory results if there is significant overlap among clusters, as it will aggregate the clusters. Finding search distance and minimum core-point density thresholds that apply globally to the entire data set may be very challenging or impossible for certain applications.
The DBSCAN algorithm is based on the calculation of distances in multi-dimensional space. Feature scaling is essential to the application of DBSCAN clustering, especially when the ranges of the features are different, for example, if they are measured in different units. Without scaling, features with larger ranges will have greater influence in computing the distances between points. The tool offers three options for feature-scaling (scaling
), including 'None', 'Normalize', and 'Standardize'. Normalization simply rescales each of the features onto a 0-1 range. This is a good option for most applications, but it is highly sensitive to outliers because it is determined by the range of the minimum and maximum values. Standardization rescales predictors using their means and standard deviations, transforming the data into z-scores. This is a better option than normalization when you know that the data contain outlier values; however, it does does assume that the feature data are somewhat normally distributed, or are at least symmetrical in distribution.
One should keep the impact of feature scaling in mind when setting the search_dist
parameter. For example, if applying normalization, the entire range of values for each dimension of feature space will be bound within the 0-1 range, meaning that the search distance should be smaller than 1.0, and likely significantly smaller. If standardization is used instead, features space is technically infinite, although the vast majority of the data are likely to be contained within the range -2.5 to 2.5.
Because the DBSCAN algorithm calculates distances in feature-space, like many other related algorithms, it suffers from the curse of dimensionality. Distances become less meaningful in high-dimensional space because the vastness of these spaces means that distances between points are less significant (more similar). As such, if the predictor list includes insignificant or highly correlated variables, it is advisable to exclude these features during the model-building phase, or to use a dimension reduction technique such as principal_component_analysis to transform the features into a smaller set of uncorrelated predictors.
The peak memory usage of this tool is approximately 8 bytes per grid cell × # predictors.
k_means_clustering, modified_k_means_clustering, principal_component_analysis