Wanna know more about data science? Make sure to check out my events and my webinar What it's like to be a data scientist and What’s the best way to become a data scientist !

Getting started with anomaly detection

Anomaly detection is one of the most interesting applications in machine learning. While anomaly detection can be done in a both supervised and unsupervised manner, in most cases, it is done through unsupervised algorithms.

One of the best ways to get started with anomaly detection in Python is the pyod library. The authors decsribe PyOD as follow

PyOD includes more than 30 detection algorithms, from classical LOF (SIGMOD 2000) to the latest COPOD (ICDM 2020). Since 2017, PyOD [AZNL19] has been successfully used in numerous academic researches and commercial products [AGSW19,ALCJ+19,AWDL+19,AZNHL19]. It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, including Analytics VidhyaTowards Data ScienceKDnuggetsComputer Vision News, and awesome-machine-learning.

So, it is clear that pyod is a good way to get started with anomaly detection!

The PyOD library

The PyOD library follows the same syntax as scikit-learn. Let’s say that you want to create a COPOD detector. COPOD is an advanced anomaly detection algorithm which stands for Copula-Based Outlier Detection. You can find the original paper here. You can do so in the following way:

# train the COPOD detector
from pyod.models.copod import COPOD
clf = COPOD()
clf.fit(X_train)

# get outlier scores
y_train_scores = clf.decision_scores_  # raw outlier scores
y_test_scores = clf.decision_function(X_test)  # outlier scores

As you can see this is the same principle as if you were creating a classifier in scikit-learn. As you notice, the model can return a raw outlier score, but also a decision outcome.

A challenge with anomaly detection algorithms, is that they are producing scores, which you then have to interpret. Determning what is an anomaly and what is not an anomaly, can often depend on the context. SO, it is quite likely you will need to use the raw decision score, and then adapt it to your domain.

There are many different outlier detection methods in PyOd, and choosing the right one requires lots of tuning an experimentation. The authors of PyOD have created a very nice graphic comparing how different algorithms would react to the same problem.

Anomaly detection

Anomaly/outlier detection is a popular subdomain of machine learning with many applications in domains like cybersecurity and fraud detection. In this short post we briefly talked about the PyOD library which is probably the best way to get started with anomaly detection in Python. If you are interested in learning more about the subject, also please make sure to check out the video below:


Wanna know more about data science? Besides my events, you should check out my webinars:
  1. If you want to learn data science: What it's like to be a data scientist and What’s the best way to become a data scientist
  2. If you are a CEO: The importance of data strategy


Dr. Stylianos Kampakis is the owner and author of The Data Scientist.