This is an R Markdown for implementation of an extreme gradient boosting technique to a big data set for classifying roadside work zones based on their collision risk. The data set is already prepared and cleaned up. The full code implementatio can be found in Classification_xgboost.R file in this repository. After partitioning the data into training and testing sets, the class imbalance is investigated.
train.ind=createDataPartition(df$collision_id, times = 1, p=0.7, list = FALSE)
training.df=df[train.ind, ]
testing.df=df[-train.ind, ]
This figure shows that class distribution is highly imbalanced. However, the xgBoost algorithm can handle data imbalance with class weights. In general binary classification, the weight of positive class in xgBoost parametrization, in this case ‘’Collision’’, is given as the ratio of number of negative class to the positive class.
label=as.numeric(as.character(training.df$collision_id))
sumwpos=sum(label==1)
sumwneg=sum(label==0)
print(sumwneg/sumwpos)
## [1] 61.35867
Since the training set is big enough to cause memory issues. Both training and testing sets are converted to sparse matrices.
dtest=sparse.model.matrix(collision_id~.-1, data = data.frame(testing.df))
dtrain=sparse.model.matrix(collision_id~.-1, training.df)
For tunning the xgboost hyperparameters, the caret grid search method is used. Learning rate or eta
, maximum depth of trees, and minimum child weight which measures the number of instances in a node before the algorithm decides to partition before. gamma
denotes the minimum loss reduction required to make a further partition. subsample
and colsample_bytree
indicate the proportion of the training data set and features, respectively, used in training the algorithm. Before training the xgboost model, a parallel backend must be registered.
xgb.grid=expand.grid(nrounds=100,
eta=seq(0.1, 1, 0.2),
max_depth=c(3, 5, 10),
gamma = 0,
subsample = 0.7,
min_child_weight = c(1, 3, 5),
colsample_bytree = 1)
myCl=makeCluster(detectCores()-1)
registerDoParallel(myCl)
xgb.control=trainControl(method = "cv",
number = 5,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "none",
classProbs = TRUE,
allowParallel = TRUE)
xgb.train = train(x = dtrain,
y = factor(label, labels = c("No.Collision", "Collision")),
trControl = xgb.control,
tuneGrid = xgb.grid,
method = "xgbTree",
scale_pos_weight=sumwneg/sumwpos)
stopCluster(myCl)
The result of the grid search for tunning the xgBoost hyperparamaters are given below. This is the best configuration of The best configuration of eta
, max_depth
, and min_child_weight
is given by:
xgb.train$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 16 100 10 0.3 0 1 1 0.1
Using these parameters, a corss-validated training is done with nrounds=500
to identify the best iteration, i.e., nrounds
. The algorithm can stop early if the test accuracy is not improved for early_stopping_rounds
number of iterations.☻
params=list("eta"=xgb.train$bestTune$eta,
"max_depth"=xgb.train$bestTune$max_depth,
"gamma"=xgb.train$bestTune$gamma,
"min_child_weight"=xgb.train$bestTune$min_child_weight,
"nthread"=4,
"objective"="binary:logistic")
xgb.crv=xgb.cv(params = params,
data = dtrain,
nrounds = 500,
nfold = 10,
label = label,
showsd = TRUE,
metrics = "auc",
stratified = TRUE,
verbose = TRUE,
print_every_n = 1L,
early_stopping_rounds = 50,
scale_pos_weight=sumwneg/sumwpos)
xgb.crv$best_iteration
## [1] 103
The model has the highest accuracy in the 103rd interation. The testing and training accuracies of the cross-validated xgBoost fitting follow each other closely, and thus there is no need to consider a gamma > 0
.
To produce feature importances, an instance of the xgBoost learner is run with the optimized parameters of the grid search xgb.train
and the cross-validated training xgb.crv
.
xgb.mod=xgboost(data = dtrain,
label = label,
max.depth=xgb.train$bestTune$max_depth,
eta=xgb.train$bestTune$eta,
nthread=4,
min_child_weight=xgb.train$bestTune$min_child_weight,
scale_pos_weight=sumwneg/sumwpos,
eval_metric="auc",
nrounds=xgb.crv$best_iteration,
objective="binary:logistic")
The top 20 features with the highest importance are selected here.
importance=xgb.importance(feature_names = colnames(dtrain), model = xgb.mod)
importance$Feature[1:20]
## [1] "closure_id1" "work_length" "collision_density11_12"
## [4] "truck_aadt" "road_adt" "closure_coverage"
## [7] "closure_length" "work_duration" "peak_aadt"
## [10] "aadt" "road_speed" "route10"
## [13] "countySJ" "activityM90000" "road_width"
## [16] "work_dayWed" "work_monthSep" "surface_typeC"
## [19] "work_dayTue" "work_monthJul"
xgBoost internally implements a dummy variable generation to transform categorical variables with \(k\) levels to \(k-1\) binary variables. The newly generated dummy variables are names by combining the feature names and its levles. The below code properly renames the top 20 features with the highest importance.
feature.label=importance$Feature[1:20]
feature.label=c("Closure = 1", "Work length", "Collision density", "Truck AADT",
"ADT", "Closure coverage", "Peak AADT", "Work duration", "Closure length",
"AADT", "Design speed", "Route ID = 10", "County = San Jose", "Activity code = M90000",
"Road width", "Surface type = Concrete", "Work month = Sep.", "Work month = Jul.",
"Work day = Wed.", "Route ID = 210")
The top 20 feautre importances are plotted here. xgBoost records three measures of importance for trees; Gain
which measure the contribution of each feature to optimization of the objective function, Cover
counts the number of observations assigned to the feature, and Weight
which denotes the number of times the feature was selected for a tree. The following plot shows the importances with respect to Gain
and has undergon a little ggplot treatment.
(gg=xgb.ggplot.importance(importance_matrix = importance[1:20,]))
It has been shown that these importances are not consistent between different data sets; see Scott Lundberg et. al. (2019) or Scott’s repository. Although, xgBoost is very powerfull in reaching high accuracies in training, it achieves that by using a large number of features. This is particularly problematic if the data set has a lot of features or is comprised of categorical variables with numerous levels. In this example, with 700 features, using prunning parameters such as gamma
, lambda
, alpha
, min_child_weight
, or max_delta_step
does not significantly reduces the number of features used in fitting the model. At the end of training, more than 400 features still remian in the model. Therefore, selecting a small subset of features for interpretation requires arbitrary importance thresholds which as we discussed before are not consistent. This is why some feature selection wrappers such as Boruta or global importance analysis like SHAP values have been proposed; see my own implementation.