Skip to content

The Data Scientist

the data scientist logo

Data Science Tutorial: A Practical Guide to Supervised Learning Algorithms 


Wanna become a data scientist within 3 months, and get a job? Then you need to check this out !

The demand for data scientists has increased dramatically and will keep rising with advancements in AI. Data analytics and AI will continue to play significant roles in business and society, according to a recent article by Thomas H. Davenport and DJ Patil in the Harvard Business Review

Data science is a multidisciplinary field that uses numerous techniques and tools to extract prominent insights from data. Machine learning (ML) combined with data science adds more potential to generate valuable information from fast-growing data. Data scientists need machine learning (ML) to predict high-value outcomes. These outcomes can provide valuable insights that can lead to informed decisions and proper actions without human intervention.

This data science tutorial will explore various supervised algorithms and their practical implementation in Python. The tutorial is designed for beginners to learn supervised learning and implement it in real-world scenarios.

What is Machine Learning?

Machine learning (ML) is a type of artificial intelligence (AI) that enables machines to learn from historical data and enhance their performance without being explicitly programmed. It focuses on developing systems that can access data and use it to learn.

Machine learning aims to allow machines to learn autonomously without human assistance and adjust accordingly.

Types of Machine Learning 

Machine learning (ML) algorithms can be divided into mainly three types, which are:

Supervised Learning 

In supervised learning, we train the machine using the labeled dataset. It works on supervision where labeled data specifies that inputs are already labeled to the output. First, we train the machine with input data and the corresponding output. Later, we ask to predict the outcome using the test dataset.

Supervised learning algorithms have interesting real-world applications: risk assessment, churn prediction, spam filtering, fraud detection, etc.

Supervised learning algorithms can be further classified into two types.

  • Regression

Regression shows the linear relationship between input (x) and output variable (y). The regression algorithms predict continuous output variables, such as weather prediction, house price prediction, market trends, etc.

The mathematical equation for linear regression is; Y = a + bX. Here, X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept.

  • Classification 

Classification algorithms can solve classification problems where the output variable is categorical, such as churned or non-churned. The classification algorithms forecast the categories present in the data.

Unsupervised Learning 

In unsupervised learning, the machine is trained using an unlabeled dataset and predicts the outcome without supervision. In simple words, there is no supervision involved in unsupervised machine learning. The model is trained with a dataset that is neither labeled nor classified. Unsupervised learning groups or categorizes unsorted data according to their similarities, patterns, and differences.

Unsupervised learning can further be divided into two types:

  • Clustering

Clustering is the way of grouping objects into clusters such that the most similar objects remain in one cluster and have fewer similarities with other cluster objects. In clustering, the intra-class similarity of objects is higher than the inter-class similarity. For instance, we can group grocery store customers by purchasing behavior into different clusters.

  • Association rule mining

Association rule mining is an unsupervised learning technique that finds interesting correlations among items in a large dataset. Association is used to find the relationship between data items. For example, it helps understand the customers‘ purchasing behavior in grocery stores. By understanding this pattern, businesses can generate more profit. Market basket analysis, web usage mining, and continuous production are applications of the association rule.

Reinforcement Learning 

In Reinforcement learning, we trained the machine models to make a sequence of decisions. The agent interacts with the environment and learns to accomplish a goal in a potentially complex and uncertain environment. The agent gets a reward for every correct decision and a penalty for the wrong action. The goal is to maximize the reward points and minimize the penalty points.

How does supervised learning work?

In supervised machine learning, the model is trained using labeled data. The model learns about each type of data. When the training phase is completed, the model is valid based on the test data and predicts the outcome.

The working of Supervised machine learning can be understood easily with the help of the diagram below.

For instance, we have a dataset containing cat and dog images. Now the first step is to train our model for each shape. In the labeled dataset, we also provide the labels such as cat and dog. After training the model, we need to validate our model using the test set, and the task is to predict whether it is a dog or cat based on some features. We have already trained our machine on all types of images of cats and dogs. Whenever our model sees new pictures, it classifies into two groups based on similar features and predicts the outcome.

Steps involved in Supervised Learning 

The steps involved in supervised learning are listed below.

  • Determine the training dataset.
  • Collection of labeled training data.
  • Splitting the data into a train-test dataset.
  • Identify the input characteristics of the training dataset.
  • Select a suitable algorithm 
  • Evaluate the model performance based on evaluation matrices.

Advantages of Supervised Learning 

  • Supervised learning allows you to gather data and predict based on prior experience.
  • In supervised learning, we are familiar with the target variables we want to predict.
  • Supervised learning algorithms help us to solve numerous real-world problems.

Disadvantages of Supervised Learning 

  • Supervised learning algorithms are not sufficient to handle complex tasks.
  • Training requires a lot of computation time and other resources.
  • Supervised learning can’t predict correct outcomes when test data differs from the training.
  • Working on big data is a real challenge for supervised algorithms.

Practical Implementation of Supervised Learning Algorithms 

In this tutorial, you will learn about implementing various supervised algorithms in Python. Scikit-learn is a powerful Python library widely used for various supervised learning tasks. It is an open-source library that provides numerous robust algorithms, which include regression, classification, dimensionality reduction, clustering techniques, and association rules.

Let’s begin!

K-Nearest Neighbors

K-nearest neighbors (KNN) is a simple supervised machine learning algorithm. In KNN, we find the similarity between new data (case) and available points using the Euclidean distance formula.

To implement the algorithm, we have taken the diabetes dataset available on Kaggle. You can easily download it from the link. The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.

First, we need to import the required libraries to access numerous functions.

#import libraries 

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns 

plt.style.use(‘ggplot’)

Now we will load the diabetes dataset into our notebook and read the csv_file. The head() function returns the first five rows of the dataset.

#Load the diabetes dataset 

my_data= pd.read_csv(‘diabetes_dataset.csv’)

my_data.head()

The dataset contains nine columns and the target class outcome. In the outcome column, 1 means diabetic patients, and 0 shows non-diabetic patients.

To view the shape of the data frame, we use my_data.shape to see the number of columns and total entries in the dataset. This dataset contains nine columns and 768 entries.

#Let’s view the shape of the dataframe. You should see (768,9)

my_data.shape

A histogram is used to illustrate the important features of the distribution of data. The hist() function is used to show the distribution of data in each numerical column.

my_data.hist(figsize=(8,8)) 

The correlation heatmap is used to find the potential relationships between variables in the data and to display the strength of those relationships. To display the heatmap, we have used the seaborn plotting library. Glucose is the major cause of diabetes in patients. In this dataset, glucose has a high correlation with the outcome.

plt.figure(figsize=(9,6))

sns.heatmap(my_data.corr(), annot=True, cmap=’icefire’).set_title(‘Heatmap Graph’)

plt.show()

#Create a numpy arrays for features and target class

X_val = my_data.drop(‘Outcome’,axis=1).values

y_val = my_data[‘Outcome’].values

For splitting the dataset into train and test, we need to import train_test_split() to split the data into the required proportion.

#import train_test_split

from sklearn.model_selection import train_test_split

We split the dataset into training and testing. We have taken 60% data for testing and 40% for training using the train_test_split() function.

X_train,X_test,y_train,y_test=train_test_split(X_val,y_val,test_size=0.6,random_state=40, stratify=y_val)

To implement the k-nearest neighbors’ algorithm we need to import KNeighborsClassifier().

#import KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

#Setup arrays to store training and test accuracies

my_neighbors = np.arange(1,10)

train_acc =np.empty(len(my_neighbors))

test_acc = np.empty(len(my_neighbors))

for i,k in enumerate(my_neighbors):

    #Setup a knn classifier with k neighbors

    knn_model = KNeighborsClassifier(n_neighbors=k)  

    #Fit the model

    knn_model.fit(X_train, y_train)

        #Compute accuracy on the training set

    train_acc[i] = knn_model.score(X_train, y_train)

    #Compute accuracy on the test set

    test_acc[i] = knn_model.score(X_test, y_test) 

Now we create the graph for training and testing accuracies to choose the best value for K.

#Generate plot 

plt.title(‘k-NN with different No of neighbors’)

plt.plot(my_neighbors, test_acc, label= ‘ Testing Accuracy ‘)

plt.plot(my_neighbors, train_acc, label= ‘ Training accuracy ‘)

plt.legend()

plt.xlabel(‘No of neighbors’)

plt.ylabel(‘Accuracy of model’)

plt.show()

From the graph, it’s clear that if we select a high value for k, we will get good testing accuracy for the K-NN model.

#Setup the numbers of k neighbors

knn_model= KNeighborsClassifier(n_neighbors=9)

knn_model.score(X_test,y_test)

The model accuracy is 0.73 when we take the number of neighbors K=9 for this dataset.

#Draw the confusion matrix

from sklearn.metrics import confusion_matrix

#Prediction of KNN Classifier 

y_prediction = knn_model.predict(X_test)

confusion_matrix(y_test,y_prediction)

#Let’s See the classification report of KNN

from sklearn.metrics import classification_report

print(classification_report(y_test,y_prediction))

Decision Tree Classifier 

A decision tree is based on rules or conditions. It is a supervised machine learning algorithm used for classification and regression problems. A decision tree has a tree-type hierarchical structure that contains a root node, branches, leaf nodes, and internal nodes. One of the decision tree’s popular variants is called CART

To implement the algorithm, we have taken the Breast Cancer dataset available on Kaggle. You can easily download it from the given link.

To apply various functions on this dataset, we need to import certain libraries.

#First import libraries 

import numpy as np 

import pandas as pd 

import matplotlib.pyplot as plt

Now we load the dataset into our notebook to perform some operations and get insights from the data.

my_data = pd.read_csv(“breast_cancer.csv”)

my_data.head()

This dataset contains 33 columns and 569 observations. The dataset also consists of Null values, which need to be eliminated before training our model. 

my_data.info()

#The line below should display (569, 33)

my_data.shape

As you can see, the dataset contains 33 fields that are not possible to display in the notebook. 

To remove the null values from the dataset, we use dropna(), and axis=1 means to remove them from columns.

my_data.dropna(axis=1, inplace=True)

The id column is also unnecessary so we removed it from the dataset using my_data.drop() function.

my_dt= my_data.drop([“id”], axis = 1)

my_dt.head(3)

Every field in the dataset contributes equally to predicting the outcome diagnosis. The cancer diagnosis is mainly two, whether it’s benign or malignant. The below code displays the two diagnoses.

The_M = my_data[my_data.diagnosis == “M”]

The_M.head()

The_B = my_data[data.diagnosis == “B”]

The_B.head(6)

The scatter plot is used here to display the texture mean and radius mean of Benign and Malignant.

#Plot both diagnosis Benign and Malignant 

plt.title(“Benign Tumor VS Malignant”)

plt.xlabel(“Radius_Mean”)

plt.ylabel(“Texture_Mean”)

plt.scatter(The_M.radius_mean, The_M.texture_mean, color = “blue”, label = “Malignant”, alpha = 0.4)

plt.scatter(The_B.radius_mean, The_B.texture_mean, color = “orange”, label = “Benign”, alpha = 0.4)

plt.legend()

plt.savefig(“importance graph 4″, facecolor=’w’, bbox_inches=”tight”,

            pad_inches=0.3, transparent=True)

plt.show()

Now let’s implement the decision tree classifier using the Scikit-learn library.

my_data.diagnosis = [1 if i == “M” else 0 for i in my_data.diagnosis]

x_val = my_data.drop([“diagnosis”], axis = 1)

y_val = my_data.diagnosis.values

The min-max normalization is used to smooth the data and transform large values into small scales to process the data easily and get high accuracy.

#Min_Max Normalization:

my_x= (x_val – np.min(x_val)) / (np.max(x_val) – np.min(x_val))

Now split the dataset into training and testing sets. We have used 40% for testing and 60% for training.

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test =train_test_split(my_x, y_val,test_size=0.4,random_state = 41)

We import DecisionTreeClassifier() from the scikit-learn library to access the functionality of the decision tree.

from sklearn.tree import DecisionTreeClassifier

my_dt = DecisionTreeClassifier()

my_dt.fit(x_train, y_train)

The score() is used to display the classifier accuracy which is 95% on this breast cancer dataset.

#let’s predict the outcomes

my_dt.score(x_test, y_test)

Linear Regression 

Linear regression is a supervised machine learning algorithm that performs a regression task. It is widely used to find the relationship between variables and forecast trends. Regression algorithms predict the result based on the independent variable. Different regression models depend on the relationship between independent and dependent variables. It also depends on the number of independent variables we want to use for a specific problem.

In linear regression, we predict the dependent variable (Y) based on a given independent variable (x). This technique finds the linear relationship between input (x) and output (y) in the given dataset. Linear regression predicts real or continuous numeric variables such as sales, age, salary, price, etc.

There are two major types of linear regression. 

Simple Linear Regression: In simple linear regression, only a single independent variable is used to predict the outcome.

Multiple Linear Regression: In multiple linear regression problems, more than one independent variable is used to predict the result. 

Let’s implement the linear regression model. 

First of all, we will import the required modules numpy and the class LinearRegression from sklearn.linear_model.

import numpy as np

from sklearn.linear_model import LinearRegression

Now we have all the functionalities that we need to implement the linear regression model. We will use the class sklearn.linear_model.LinearRegression to perform different types of regressions and make predictions.

The next step is to define the data to work with. The inputs (x) and output (y) should be arrays containing similar objects. This is a simple way to provide dummy data for regression.

#Dummy Dataset of X and Y values

my_x = np.array([5, 10, 20, 25, 35, 55, 58, 68]).reshape((-1, 1))

my_y = np.array([5, 20, 14, 22, 34, 48, 56, 58])

Now when you print it x and y. We have two arrays of input (x) and output (y). When we apply reshape() on x, it will transform it into a 2D dimensional array containing one column and as many rows as possible. 

To display what is inside my_x and my_y variables, simply print it. 

print(my_x)

print(my_y)

Now you can see that x has two dimensions as x.shape is (8,1), while y has only one dimension y.shape is (8,).

We must create a linear regression model and fit it using the previous data. Now create an instance of the class LinearRegression

my_model = LinearRegression()

This code snippet will create the variable my_model as an instance of LinearRegression. We can also provide various parameters to LinearRegression.

  • n_jobs is either None or an integer. It shows the number of jobs to be used in parallel computation. By default, it’s None which means one job, while -1 represents all the available processors.
  • copy_x shows whether to copy (True) or overwrite the input variable (False). By default, its value is True.
  • fit_intercept if True decides to calculate the intercept b0 or False, then consider it equal to zero.
  • normalize if True decides to normalize the input variables. By default, it is False, meaning it doesn’t normalize the input variables.

Now let’s use the model. First, we need to call the function .fit() on the model.

#Fit the linear regression model

my_model.fit(my_x, my_y)

We use .fit() to calculate the optimal values of b0 and b1, using the input (x) and output (y) as arguments. In a nutshell, the .fit() returns self, the variable my_model itself. The below statement is similar to the previous one.

Model_fit=LinearRegression().fit(my_x,my_y)

We have fitted our linear regression model, and now we will get the result to check the performance of our model. We can obtain the R_Square value, the coefficient of determination, by simply calling .score().

R_Square=my_model.score(my_x, my_y)

print(f”coefficient of value: {R_Square}”)

print(f”intercept value: {my_model.intercept_}”)

print(f”slope value: {my_model.coef_}”)

The value of b0 is 3.47. This shows that our model predicts the response 3.47 when x equals zero. The b1 value 0.83 means that the predicted value rises by 0.83 when the value of x increases.

#Now Let’s predict the outcomes

my_predict=my_model.predict(my_x)

print(f”Let’s predict the outcomes:n{my_predict}”)

Wrapping up  

You can automate future projections based on labeled data using supervised learning algorithms. This can prove to be a much bigger improvement over manual classification methods. However, the overfitting of models must be avoided when using supervised learning algorithms, which calls for human expertise.

We have covered several implementations of supervised learning algorithms in this tutorial. Machine learning is a remarkably potent tool for solving complex problems in astronomy, economics, and other fields. As no decline is anticipated in the subsequent years, demand for machine learning engineers has skyrocketed and will continue to do so.

Want to secure your dream job, switch to a billion-dollar lucrative career or improve your knowledge in your field? Why not learn from the masters? Become data literate by building your knowledge in data science, machine learning, AI, and programming languages at your own pace.

We’ll empower you with relevant data skills to remain relevant in the revolutionised digital economy. Reach out to our team now and discover a world of opportunities in data science, AI, and machine learning.


Wanna become a data scientist within 3 months, and get a job? Then you need to check this out !