top of page
realcode4you

What is machine learning Pipeline and Hyperparameter Tunning?

What is a pipeline

1. Almost always, we need to tie together many different processes that we use to prepare data for machine learning based model

2. It is paramount that the stage of transformation of data represented by these processes are standardized

3. Pipeline class of sklearn helps simplify the chaining of the transformation steps and the model

4. Pipeline, along with the GridsearchCV helps search over the hyperparameter space applicable at each stage



Pipelines

1. Sequentially apply a list of transforms and a final estimator.

2. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.

3. The final estimator only needs to implement fit

4. Helps standardize the model project by enforcing consistency in building testing and production.



Build a pipeline

1. Import the pipeline class a. from sklearn.pipeline import Pipeline

2. Instantiate the class into an object by listing out the transformation steps. In the following example, a scaling function is followed by the logistic algorithm

  • pipe = Pipeline([(" scaler", MinMaxScaler()), (" lr", logisticregression())])


3. Call the fit() function on the pipeline object

  • pipe.fit( X_train, y_train)


4. Call the score() function on the pipeline object or predict() function

  • pipe.score( X_test, y_test)


In the step 2b, the pipeline object is created using a dictionary of key:value pairs. The key is specified in strings for e.g. “scaler” followed by the function to be called.


The key is the name given to a step.

  • The pipeline object requires all the stages to have both ‘fit()’ and “transform()” function except for the last stage when it is an estimator

  • The estimator does not have a “transform()” function because it builds the model using the data from previous step. It does not transform the data

  • The transform function transforms the input data and emits transformed data as output which becomes the input to the next stage

  • ● pipeline.fit() calls the fit and transform functions on each stage in sequence. In the last stage, if it is an estimator, only the fit function is called to create the model.

  • The model become a part of the pipeline automatically

  • pipeline.predict() calls the transform function at all the stages.

  • The pipeline object does not need to have a predict function. It only needs to have a fit function at least















  • pipeline.predict() calls the transform function at all the stages on the given data

  • In the last stage, it jumps the estimator step because the model is already built

  • It executes the predict() function of the model


Note:

  • A pipeline can be constructed purely for data transformation alone. Which means there it is not mandatory to have an estimator.


make_pipeline

  • Specifying names for the different stages of a pipeline can lead to ambiguities. When there are multiple stages, each stage has to be uniquely named and we have to make sure there is consistency in the naming process such as usage of lower case letters only, each name should be unique, name should reflect the purpose of the stage etc. Manual naming is prone to ambiguity

  • Alternatively we can use “make_pipeline()” function that will create the pipeline and automatically name each step as the lowercase of the name of the function called ○ from sklearn.pipeline import make_pipeline ○ pipe = make_pipeline( MinMaxScaler(), (LogisticRegression())) ○ print(" Pipeline steps:\ n{}". format( pipe.steps))

  • The advantage of “make_pipeline” is the consistency in the naming of each stage, we can have multiple stages in a pipeline performing the same transformations. Each stage is guaranteed to have a unique meaningful name.


Hyper Parameters & Tuning


1. Hyper parameters are like handles available to the modeler to control the behavior of the algorithm used for modeling

2. Hyper parameters are supplied as arguments to the model algorithms while initializing them. For e.g. setting the criterion for decision tree building

“dt_model = DecisionTreeClassifier(criterion = 'entropy’)” 

3. To get a list of hyper parameters for a given algorithm, call the function get_params()…for e.g. to get support vector classifier hyper parameters.

  • from sklearn.svm importSVC

  • svc= SVC()

  • svc.get_params()

4. Hyper parameters are not learnt from the data as other model parameters are. For e.g. attribute coefficients in a linear model are learnt from data while cost of error is input as hyper parameter.


5. Fine tuning the hyper parameters is done in a sequence of steps

  • Selecting the appropriate model type (regressor or classifier such assklearn.svm.SVC())

  • Identify the corresponding parameterspace

  • Decide the method for searching or sampling parameterspace;

  • Decide the cross-validation scheme to ensure model will generalize

  • Decide a score function to use to evaluate themodel

6. Two generic approaches to searching hyper parameter space include

  • GridSearchCV which exhaustively considers all parametercombinations

  • RandomizedSearchCV can sample a given number of candidates from a parameterspace with a specified distribution.


7. While tuning hyper parameters, the data should have been split into three parts

– Training, validation and testing to prevent data leak

8. The testing data should be separately transformed * using the same functions that were used to transform the rest of the data for model building and hyper parameter tuning


Hyper Parameters & Tuning (GridsearchCV/ RandomizedSearchCv)

GridsearchCV


  • Is at basic optimal hyperparameter tuning technique.

  • It builds a model for each permutation of all of the given hyperparameter values

  • Each such model is evaluated and ranked.

  • The combination of hyperparameter values that gives the best performing model is chosen

  • For every combination, cross validation is used and average score is calculated

  • This is an exhaustive sampling of the hyperparameter space and can be quite inefficient

RandomizedSearchCV –

  • Random search differs from grid search. Instead of providing a discrete set of values to explore on each hyperparameter (parameter grid), we provide a statistical distribution.

  • Values for the different hyper parameters are picked up at random from this combine distribution

  • The motivation to use random search in place of grid search is that for many cases, hyperparameters are not equally important



If you need any help in machine learning pipeline and hyper parameter tuning related problems then you need to send your requirement details at:

realcode4you@gmail.com

Comments


bottom of page