Wanna know more about data science? Make sure to check out my events and my webinar What it's like to be a data scientist and What’s the best way to become a data scientist !

Gradient boosted trees

Gradient boosted trees is one of the most popular techniques in machine learning and for a good reason. It is one of the most powerful algorithms in existence, works fast and can give very good solutions. This is one of the reasons why there are many libraries implementing it! This makes it difficult to choose which one is the best for a beginner data scientist.

On this article, we are going to examine all the different ways to run gradient boosted trees in Python.

Gradient boosted tree libraries

There are 4 popular libraries for gradient boosted trees.

  1. Scikit-learn
  2. XGBoost
  3. LightGBM
  4. Catboost

Scikit-learn’s implementation is the least popular, so it’s probably not worth discussing. So, let’s move on to the rest

XGboost

XGboost is by far the most popular gradient boosted trees implementation. XGboost is described as “an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable”. XGBoost is supported for both R and Python. The bad thing about XGBoost is that it uses its own design for loading and processing data. So, if you want to use XGBoost in Python you have to do something along the following lines.

import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')
dtest = xgb.DMatrix('demo/data/agaricus.txt.test')
# specify parameters via map
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
preds = bst.predict(dtest)

The good news is that there a scikit-learn wrapper which can help you use XGBoost alongside common scikit-learn functions. So, you can do something along the following lines:

from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X_train, y_train)

XGBoost is a very well tested and tried model, and a favourite of Kagglers. That being said, it is not the fastest model out there. For this, we would have to look towards LightGBM.

LightGBM

LightGBM was created by Microsoft Research and it is an implementation of gradient boosted trees, aiming to have very fast performance. The first time I tried LightGBM I was truly amazed by its performance. We won’t go into the details of the algorithm, but there are some fundamental differences in how it handles a few things, like categorical features. The two main techniques are called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). You can learn more about them in the paper.

LightGBM has become a standard choice for me, since it is so fast and easy to use. It is now usually one of the first algorithms I will deploy. As you can imagine, it offers a scikit-learn API, which makes creating a model very reasy.

from lightgbm import LGBMClassifier
model = LGBMClassifier()

CatBoost

CatBoost is another gradient boosted trees implementation. It has been developed by Yandex researchers. CatBoost’s highlight is the special treatment of categorical features.

CatBoost shows a comparison of different algorithms on their site. As you can see, different gradient boosted trees implementations have very similar performance.

CatBoost can also be used inside scikit-learn. It is very easy to import it and create a model.

from catboost import CatBoostClassifier
model = CatBoostClassifier(n_estimators=100)

Which gradient boosted trees library is the best?

Choosing one library is not easy. As, you saw, however, the performance of the different libraries is quite similar on  many benchmarks. A personal favourite of mine is LightGBM. It is just so fast, that it makes experimentation a breeze. However, in terms of performance, any of the above algorithms is a great choice. In general, I’ve found that what is important is not so much a particular implementation, as the family of algorithms. If you get good performance with LightGBM, you will probably going to get good performance on a problem with other gradient boosted trees algorithms, as well. It’s just that LightGBM is very fast and quickly help you decide whether to proceed with gradient boosted trees or not.


Wanna know more about data science? Besides my events, you should check out my webinars:
  1. If you want to learn data science: What it's like to be a data scientist and What’s the best way to become a data scientist
  2. If you are a CEO: The importance of data strategy


Dr. Stylianos Kampakis is the owner and author of The Data Scientist.