Creating a pipelines using sklearn machine learning.

Lerekoqholosha
4 min readNov 29, 2021
Unsplash Image: theblowup

Getting familiar with ML pipelines

First before we can move on to the coding part i want us to understand the term pipeline in machine learning.

  • A machine learning pipeline is a way to codify and automate the workflow it takes to produce a machine learning model.
  • Machine learning pipelines consist of multiple sequential steps that do everything from data extraction and preprocessing to model training and deployment.

Now that we understand the pipeline in machine learning let’s get started

Step 1: Importing libraries and Load Data

#importing librariesfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegressionfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifier

Load data:

iris=load_iris()

let’s look at features and labels:

#features
iris.data
features
#target
iris.target
target column

Step 2: Splitting data into training and testing data

X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=.2,random_state=1)

Step 3: Creating pipelines

Pipelines creaton:

  1. Data Preprocessing by using Standard Scaler
  2. Reduce Dimension using PCA
  3. Apply Classifier
#Logistic pipeline
pipeline_lr=Pipeline([('scalar1',StandardScaler()),
('pca1',PCA(n_components=2)),('lr_classifier',LogisticRegression())])#DecisionTree pipeline
pipeline_dt=Pipeline([('scalar2',StandardScaler()),
('pca2',PCA(n_components=2)),('dt_classifier',DecisionTreeClassifier(random_state=0))])#RandomForest pipeline
pipeline_randomforest=Pipeline([('scalar3',StandardScaler()),
('pca3',PCA(n_components=2)),('rf_classifier',RandomForestClassifier(random_state=0))]

Let’s make a list of pipelines:

pipelines = [pipeline_lr, pipeline_dt, pipeline_randomforest]
#create variable for accuarcy and best modelbest_accuracy=0.0best_classifier=0best_pipeline=""

Dictionary of pipelines and classifier types for ease of reference

pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'RandomForest'}# Fit the pipelinesfor pipe in pipelines:pipe.fit(X_train, y_train)

The next step after fitting the data to our models is to look at accuracy of each model how well they did in the testing data.

for i,model in enumerate(pipelines):print("{} Test Accuracy: {}".format(pipe_dict[i],model.score(X_test,y_test)))
Accuracy Score
#best classifier
for i,model in enumerate(pipelines):
if model.score(X_test,y_test)>best_accuracy:best_accuracy=model.score(X_test,y_test)best_pipeline=modelbest_classifier=iprint('Classifier with best accuracy:{}'.format(pipe_dict[best_classifier]))
Best Classifier

Pipelines Perform Hyperparameters Tuning Using Grid SearchCV

# Create a pipelinepipe = Pipeline([("classifier", RandomForestClassifier())])# Create dictionary with candidate learning algorithms and their hyperparametersgrid_param = [{"classifier": [LogisticRegression()],"classifier__penalty": ['l2','l1'],"classifier__C": np.logspace(0, 4, 10)},{"classifier": [LogisticRegression()],"classifier__penalty": ['l2'],"classifier__C": np.logspace(0, 4, 10),"classifier__solver":['newton-cg','saga','sag','liblinear'] ##This solvers don't allow L1 penalty},{"classifier": [RandomForestClassifier()],"classifier__n_estimators": [10, 100, 1000],"classifier__max_depth":[5,8,15,25,30,None],"classifier__min_samples_leaf":[1,2,5,10,15,100],"classifier__max_leaf_nodes": [2, 5,10]}]# create a gridsearch of the pipeline, the fit the best modelgridsearch = GridSearchCV(pipe, grid_param, cv=5, verbose=0,n_jobs=-1) # Fit grid searchbest_model = gridsearch.fit(X_train,y_train)
Gridsearch results
print(best_model.best_estimator_)print("The mean accuracy of the model is:",best_model.score(X_test,y_test))
Accuracy Score

Making pipelines using sklearn:

from sklearn.pipeline import make_pipeline

Make a pipeline:

# Create a pipelinepipe = make_pipeline((RandomForestClassifier()))# Create dictionary with candidate learning algorithms and their hyperparametersgrid_param = [{"randomforestclassifier": [RandomForestClassifier()],"randomforestclassifier__n_estimators": [10, 100, 1000],"randomforestclassifier__max_depth":[5,8,15,25,30,None],"randomforestclassifier__min_samples_leaf":[1,2,5,10,15,100],"randomforestclassifier__max_leaf_nodes": [2, 5,10]}]# create a gridsearch of the pipeline, the fit the best modelgridsearch = GridSearchCV(pipe, grid_param, cv=5, verbose=0,n_jobs=-1) # Fit grid searchbest_model = gridsearch.fit(X_train,y_train)best_model.score(X_test,y_test)
Accuracy Score

Conclusion:

Pipelines keep our preprocessing steps and giving the summary of our models , making the machine learning workflow much easier. We can apply more than one preprocessing step if needed before fitting a model in the pipeline. The main benefit for me has been being able to come back to a project and following the workflow I set with pipelines. This process would take hours before I learned about pipelines. I hope this article can become a helpful resource in learning the pipeline workflow.

Thank you for reading :).

Please let me know if you have any feedback.

Resources:

GitHub

Colab

sklearn documentation

PCA

Loading iris dataset sklearn

--

--

Lerekoqholosha

I am a data scientist with 1 year of experience working with Python.