Saving models#
import dill
import pickle
import numpy as np
from IPython.display import HTML
from sklearn.preprocessing import FunctionTransformer
pickle#
Pickle is simpliest way to save python objects and sklearn models/transformers as well.
Basic#
Here is a simple sklearn transformer that returns an array with specified row in each value with same number of observations as in input array.
def test(X):
return np.array([["im from pickle"]]*X.shape[0])
transformer_obj = FunctionTransformer(test)
transformer_obj.fit_transform(np.array([[2],[3],[4]]))
array([['im from pickle'],
['im from pickle'],
['im from pickle']], dtype='<U14')
Here is option to save it using pickle. After saving, the object is removed from python memory.
with open("saving_models_files/pickled_transformer.pkl", "wb") as f:
pickle.dump(transformer_obj, f)
del transformer_obj
Now let’s load the model from the file - all goes well.
with open("saving_models_files/pickled_transformer.pkl", "rb") as f:
loaded_transformer = pickle.load(f)
loaded_transformer.transform(np.array([[1],[2]]))
array([['im from pickle'],
['im from pickle']], dtype='<U14')
Troubles with functions#
When using pickle to save models, there is one nuance - the functions you use in your pipline must be available where you are going to deploy it.
Here the transformer is created and saved as in the previous section, but after saving not only the transformer itself is deleted, but also the function used in it.
def test(X):
return np.array([["im from pickle"]]*X.shape[0])
transformer_obj = FunctionTransformer(test)
transformer_obj.fit_transform(np.array([[2],[3],[4]]))
with open("saving_models_files/pickled_transformer.pkl", "wb") as f:
pickle.dump(transformer_obj, f)
del transformer_obj, test
Now let’s try to load this function from a file - and get an error saying that there is no access to the requested function.
try:
with open(
"saving_models_files/pickled_transformer.pkl", "rb"
) as f:
loaded_transformer = pickle.load(f)
except Exception as e:
print("Got exception:", e)
Got exception: Can't get attribute 'test' on <module '__main__'>
dill#
Basic usage#
In the troubles with functions section, I mentioned that Pickle doesn’t store functions that can be used in Sklearn constructions. Using dill for this purpose can help to solve this problem. Let’s try the same example using the Dill module instead of the pickle
module use dill
.
The following cell creates a Sklearn transformer with specific behaviour:
def test(X):
return np.array([["im from dill"]]*X.shape[0])
transformer_obj = FunctionTransformer(test)
transformer_obj.fit_transform(np.array([[2],[3],[4]]))
array([['im from dill'],
['im from dill'],
['im from dill']], dtype='<U12')
Now let’s save it and immediately delete the transformer and use it.
with open("saving_models_files/dilled_transformer.pkl", "wb") as f:
dill.dump(transformer_obj, f)
del transformer_obj, test
After loading it with dill it still saves it behaviour:
with open("saving_models_files/dilled_transformer.pkl", "rb") as f:
transformer_loaded = dill.load(f)
transformer_loaded.transform(np.array([[2],[3]]))
array([['im from dill'],
['im from dill']], dtype='<U12')
Imported modules#
Problem description#
In the previous section, the function used in the transformer was defined in the same module as the object to be stored. This works fine. But it’s much more common to store some functions in other modules, for example for testing purposes. Let’s reproduce this problem.
In the following cell I in separate file function that our transformer will use.
%%writefile saving_models_files/dill_function.py
import numpy as np
def test(X):
return np.array([["im from dill"]]*X.shape[0])
Overwriting saving_models_files/dill_function.py
Here is a transformer created using the from module function.
from saving_models_files.dill_function import test
transformer_obj = FunctionTransformer(test)
display(transformer_obj.fit_transform(np.array([[2],[3],[4]])))
with open("saving_models_files/dilled_other_module_transformer.pkl", "wb") as f:
dill.dump(transformer_obj, f)
array([['im from dill'],
['im from dill'],
['im from dill']], dtype='<U12')
If you now try to load a file from a file that doesn’t have the same access as test
, you will get an error.
In the following cell code that tries to load the transformer is saved in a different folder, so there is no path like saving_models_files/dill_function.py
there.
%%writefile saving_models_files/dill_loader.py
import dill
import numpy as np
try:
with open("dilled_other_module_transformer.pkl", "rb") as f:
transformer_loaded = dill.load(f)
print(transformer_loaded.fit_transform(np.array([[2],[3],[4]])))
except Exception as e:
print("Got exception:", e)
Overwriting saving_models_files/dill_loader.py
Now let’s try to run this code from its folder.
%%bash
cd saving_models_files
python3 dill_loader.py
Got exception: No module named 'saving_models_files'
The result is error.
Solution#
There is a really similar blog to this - https://oegedijk.github.io/blog/ check it out. And there is described the same issue. Solution is stolen there.
The problem is that python knows the modules from which the objects are taken. So the next cell shows the module attribute for imponentiated and locally declared functions.
from saving_models_files.dill_function import test
def test2():
pass
print(
"test.__module__ - ", test.__module__, "\n",
"test2.__module__ - ", test2.__module__, sep = ""
)
test.__module__ - saving_models_files.dill_function
test2.__module__ - __main__
It looks like dill
only saves objects defined in __main__
. So one of the solutions is to simply redefine them in the current module. So a mainify
function from the following cell will do it. This shows that the __module__
attribute was really changed, and saves trasnformer that was created with the changed function.
def mainify(obj):
if obj.__module__ != "__main__":
import __main__
import inspect
s = inspect.getsource(obj)
co = compile(s, '<string>', 'exec')
exec(co, __main__.__dict__)
# showing that mainify really changes __module__
# attribute
from saving_models_files.dill_function import test
print(f"before mainify test.__module__=='{test.__module__}'",)
mainify(test)
print(f"after mainify test.__module__=='{test.__module__}'", end = "\n\n")
transformer_obj = FunctionTransformer(test)
with open("saving_models_files/dilled_other_module_transformer.pkl", "wb") as f:
dill.dump(transformer_obj, f)
del transformer_obj, test
before mainify test.__module__=='saving_models_files.dill_function'
after mainify test.__module__=='__main__'
Finally let’s try to execute the same code that we was trying to exceture before.
%%bash
cd saving_models_files
python3 dill_loader.py
[['im from dill']
['im from dill']
['im from dill']]
Now code goes fine.