What Are Pipelines
Pipelines are a simple way to keep your data processing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step. Many data scientists hack together models without pipelines, but Pipelines have some important benefits. Those include:- Cleaner Code: You won't need to keep track of your training (and validation) data at each step of processing. Accounting for data at each step of processing can get messy. With a pipeline, you don't need to manually keep track of each step.
- Fewer Bugs: There are fewer opportunities to mis-apply a step or forget a pre-processing step.
- Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
- More Options For Model Testing: You will see an example in the next tutorial, which covers cross-validation.
Example
We won't focus on the data loading. For now, you can imagine you are at a point where you already have train_X, test_X, train_y and test_y.Code
You have a modeling process that uses an Imputer to fill in missing values, followed by a RandomForestRegressor to make predictions. These can be bundled together with the make_pipeline function as shown below.
In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())
You can now fit and predict using this pipeline as a fused whole.
In [3]:
my_pipeline.fit(train_X, train_y)
predictions = my_pipeline.predict(test_X)
For comparison, here is the code to do the same thing without pipelines
In [4]:
my_imputer = Imputer()
my_model = RandomForestRegressor()
imputed_train_X = my_imputer.fit_transform(train_X)
imputed_test_X = my_imputer.transform(test_X)
my_model.fit(imputed_train_X, train_y)
predictions = my_model.predict(imputed_test_X)
This particular pipeline was only a small improvement in code elegance. But pipelines become increasingly valuable as your data processing becomes increasingly sophisticated.
Comments