1 Overview

Gradient boosting creates an ensemble of weak learners (typically decision trees), where each new tree is fit on a modified version of the original data set. At each stage we have an imperfect model \(F_m\) and some new estimator (what the algorithm is adding) \(h_m(x)\), so \[F_{m+1}(x) = F_m(x) + h_m(x) = y \]

We fit h to the residual, \(h_m(x) = y-F_m(x)\).

Many models are trained in a gradual, additive, and sequential manner. Gradient boosting identifies the shortcomings of the weak models using the gradients in the loss function.

1.1 LightGBM Parameters

Building off the table here

Also see:

1.1.1 Fixed Parameters

Name of parameter	Default value	Ranges	Notes
objective	regression	Regression, binary	Affects other parameters
metric	null	+20 different metrics	Null means that metric corresponding to specified objective will be used
boosting	gbdt	gbdt, rf, dart, goss	RF → bagging approach
categorical_feature	Empty string	Specify a number for a column index	Handle categorical features
bagging_freq	0.0	[0, ∞]	0 means disable bagging; k means perform bagging at every k iteration

1.1.2 Regularization/Training Parameters

Name of parameter	Default value	Ranges	Notes	Tuning
bagging_fraction/subsample	1.0	[0, 1]	Randomly select part of data without resampling	Improves generalization and speed of training
feature_fraction	1.0	[0, 1]	% of features on each tree	Speeds up training, deals with overfitting
num_leaves	31	[1, ∞]	Max number of leaves in one tree	Generally num_leaves = 2^(max_depth), but these too should be tuned together
max_depth	-1	[-1, ∞]	Larger is usually better	Too large → over fit
max_bin	255	[2, ∞]	Each continuous feature split into discrete bins	Small is faster, larger more accurate
min_data_in_leaf	20	[0, ∞]

1.1.3 Other Parameters

Name of parameter	Default value	Ranges	Notes	Tuning
learning_rate	0.1	[0 1]	Typical: 0.05	Larger reduces training time
num_iterations/num_rounds/nrounds	100	[1, ∞]	Number of trees to build	Start small then increase, use smaller learning_rate and larger num_iterations
early_stopping_round	0	[0, ∞]	Will stop training if validation doesn’t improve in last early_stopping_round	Generally 10% num_iterations

2 CatBoost vs LightGBM vs XGBoost

CatBoost and LightGBM are both gradient boosting methods, but differ in how they grow trees/make splits, handle missing data, and handle categorical data.

2.1 Growing trees/making splits

LightGBM:

(not default method) uses gradient based one side sampling (GOSS): selects splits using all the instances with large gradients/large errors and a random sample of instances with small gradients
- Also introduces a constant multiplier for the data with small gradients to keep the same data distribution when computing the information gain
- Reduces number of data instances but keeps accuracy
- Quickly finds the most influential cuts
Uses leaf-wise (best-fit) tree growth: allows for imbalanced trees by choosing the leaf that minimizes the loss
- Overfitting can happen when data is small - control tree depth in this case

CatBoost:

Grows oblivious trees: trees are grown by testing the same predictor with the same conditions on all nodes at the same level
- Minimal Variance Sampling (MVS): weighted sampling version of stochastic gradient boosting, weighted sampling happens in the tree-level and not in the split-level
Default give symmetric/balanced trees: the feature-split pair that brings the lowest loss is selected and used for all the level’s nodes

XGBoost:

Similar to LightGBM it uses the gradients of different cuts to select the next cut, but also uses the hessian/second derivative to rank the cuts
- It uses a histogram based algorithm to split all the data points for a feature into discrete bins and uses these bins to find the split value of histogram
Splits trees up to max_depth, then prunes backwards to remove splits without positive gain
- Can also grow leaf-wise

2.2 Handling Data

2.2.1 Missing Data

Catboost:

Two options: min and max
- Missing values are processed as the minimum/maximum value for a feature (given a value less than/greater than all others)
- Guaranteed to consider a split that separates missing values from all other values

LightGBM/XGBoost:

Missing values are allocated to the side that reduces the loss in each split

2.2.2 Categorical Data