Gradient boosted trees
Gradient boosted trees is one of the most popular techniques in machine learning and for a good reason. It is one of the most powerful algorithms in existence, works fast and can give very good solutions. This is one of the reasons why there are many libraries implementing it! This makes it difficult to choose which one is the best for a beginner data scientist.
On this article, we are going to examine all the different ways to run gradient boosted trees in Python.
Gradient boosted tree libraries
There are 4 popular libraries for gradient boosted trees.
- Scikit-learn
- XGBoost
- LightGBM
- Catboost
Scikit-learn’s implementation is the least popular, so it’s probably not worth discussing. So, let’s move on to the rest
XGboost
XGboost is by far the most popular gradient boosted trees implementation. XGboost is described as “an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable”. XGBoost is supported for both R and Python. The bad thing about XGBoost is that it uses its own design for loading and processing data. So, if you want to use XGBoost in Python you have to do something along the following lines.
import xgboost as xgb # read in data dtrain = xgb.DMatrix('demo/data/agaricus.txt.train') dtest = xgb.DMatrix('demo/data/agaricus.txt.test') # specify parameters via map param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' } num_round = 2 bst = xgb.train(param, dtrain, num_round) # make prediction preds = bst.predict(dtest)
The good news is that there a scikit-learn wrapper which can help you use XGBoost alongside common scikit-learn functions. So, you can do something along the following lines:
from xgboost import XGBClassifier model = XGBClassifier() model.fit(X_train, y_train)
XGBoost is a very well tested and tried model, and a favourite of Kagglers. That being said, it is not the fastest model out there. For this, we would have to look towards LightGBM.
LightGBM
LightGBM was created by Microsoft Research and it is an implementation of gradient boosted trees, aiming to have very fast performance. The first time I tried LightGBM I was truly amazed by its performance. We won’t go into the details of the algorithm, but there are some fundamental differences in how it handles a few things, like categorical features. The two main techniques are called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). You can learn more about them in the paper.
LightGBM has become a standard choice for me, since it is so fast and easy to use. It is now usually one of the first algorithms I will deploy. As you can imagine, it offers a scikit-learn API, which makes creating a model very reasy.
from lightgbm import LGBMClassifier model = LGBMClassifier()
CatBoost
CatBoost is another gradient boosted trees implementation. It has been developed by Yandex researchers. CatBoost’s highlight is the special treatment of categorical features.
CatBoost shows a comparison of different algorithms on their site. As you can see, different gradient boosted trees implementations have very similar performance.
CatBoost can also be used inside scikit-learn. It is very easy to import it and create a model.
from catboost import CatBoostClassifier model = CatBoostClassifier(n_estimators=100)
Which gradient boosted trees library is the best?
Choosing one library is not easy. As, you saw, however, the performance of the different libraries is quite similar on many benchmarks. A personal favourite of mine is LightGBM. It is just so fast, that it makes experimentation a breeze. However, in terms of performance, any of the above algorithms is a great choice. In general, I’ve found that what is important is not so much a particular implementation, as the family of algorithms. If you get good performance with LightGBM, you will probably going to get good performance on a problem with other gradient boosted trees algorithms, as well. It’s just that LightGBM is very fast and quickly help you decide whether to proceed with gradient boosted trees or not.
Do you want to become data scientist?
Do you want to become a data scientist and pursue a lucrative career with a high salary, working from anywhere in the world? I have developed a unique course based on my 10+ years of teaching experience in this area. The course offers the following:
- Learn all the basics of data science (value $10k+)
- Get premium mentoring (value at $1k/hour)
- We apply to jobs for you and we help you land a job, by preparing you for interviews (value at $50k+ per year)
- We provide a satisfaction guarantee!
If you want to learn more book a call with my team now or get in touch.
- Harnessing Blockchain for Academic Records: A Data-Driven Approach to Securing Student Achievements
- Introduction to tokenomics: New course by the Tesseract Academy!
- What Is Application Development? The Types of Apps and Development Process
- MLS Software Development: How Can It Improve Your Real Estate Business?