So you've trained your spaCy NLP model, you've got it to the highest accuracy you can push it to, and now you want to use it in your project or product. Now, what do you do?
If we were to take a glance at the official spaCy documentation for packaging, they advise using their command line interface to build a deployable Python package containing the model. This results in a package that can then be imported as follows:
import my_model_name nlp = my_model_name.load()
However, that isn't good enough for us, organization-wise. Imagine having to manage so many separate packages for each machine learning model that you want to deploy. Just think of those import statements! Oh the horror! shudders
The perfect way to organize our machine learning models would be to have an overarching package that we can then import as so:
from my_models import my_nlp_model, my_log_reg_model, my_mv_reg_model
So ideally, we would want to have the spaCy model as a sub-package, instead of having it as a whole other package that we have to install as a separate dependency.
Investigating spaCy's Source Code
Most thankfully, spaCy's Github repo is quite sane-ly organized, with their cli-related scripts mapping one-to-one with each cli command. What we are interested, is to find out what exactly the
package command does to our trained and saved spaCy model. The source code is available here for you to check it out yourself.
Notable things that we can see, is that the script basically performs file copying of the key model files, creates the following key files:
meta.jsoncontaining the model metadata
setup.pyfile for the final python packaging step
__init__.pyto signify that the model is a package.
We can also see that there are template strings used to creat the above files, which are placed at the bottom of the
As we want to have the model as a sub-package, we are mostly interested with the
__init__.py file, as we need to find out how the model is loaded when imported when the
package CLI command is used.
We can see the following under the
# coding: utf8 from __future__ import unicode_literals from pathlib import Path from spacy.util import load_model_from_init_py, get_model_meta __version__ = get_model_meta(Path(__file__).parent)['version'] def load(**overrides): return load_model_from_init_py(__file__, **overrides)
Hot on trail, we need to find out what this
load_model_from_init_py function does in the
Referencing the function definition, we can see the following:
def load_model_from_init_py(init_file, **overrides): """Helper function to use in the `load()` method of a model package's __init__.py. init_file (unicode): Path to model's __init__.py, i.e. `__file__`. **overrides: Specific overrides, like pipeline components to disable. RETURNS (Language): `Language` class with loaded model. """ model_path = Path(init_file).parent meta = get_model_meta(model_path) data_dir = "%s_%s-%s" % (meta["lang"], meta["name"], meta["version"]) data_path = model_path / data_dir if not model_path.exists(): raise IOError(Errors.E052.format(path=path2str(data_path))) return load_model_from_path(data_path, meta, **overrides)
Basically, what the function does is to extract and reference the model by path, before handing off the model loading to the
Hence, when we load our spaCy model in our own package, we will need to do the same, sans the templated directory name, as we will not be using the spaCy cli to do the file copying.
Down to Business: Creating Our SpaCy Sub-Package
Place your trained model data directory into a sub-directory under your main source folder.
For example, if our eventual package is to be called
models, then organize your files as follows:
models |- my_spacy_model |- meta.json |- __init__.py |- ... |- __init__.py
Notice that there are two
__init__.py files, one in the main package directory, and another in the SpaCy model directory. This basically signals to Python that you want these directories to be package directories, and will execute the
__init__.py file whenever you import these packages.
In the SpaCy model's
__init__.py file, we're going to add in this snippet of code:
from __future__ import unicode_literals from pathlib import Path from spacy.util import load_model_from_path, get_model_meta # __version__ = get_model_meta(Path(__file__).parent)['version'] def load(**overrides): init_file = Path(__file__) model_path = Path(__file__).parent meta = get_model_meta(model_path) if not model_path.exists(): raise IOError(Errors.E052.format(path=path2str(model_path))) return load_model_from_path(model_path, meta, **overrides)
Looks familiar? Hell yeah, of course it does, it is adapted from the “official” SpaCy
package script. What we are basically doing, is referencing the current
__init__.py file, and using it to get our model path, and then retrieve the model metadata and pass off the model loading to the same
load_model_from_path function that SpaCy uses internally. This will happen each time we run the function
my_spacy_model.load(), and the overrides kwargs that we can ass to the model will also be the exact same as if we are loading up the model with
We can then use the model as follows:
from models import my_spacy_model nlp = my_spacy_model.load()
Writing Tests For Our Package (Optional)
Of course, what kind of software developer would we be if we didn't test our desired outcome? Yes, that's right, we would be terrible software developers. Let's listen to the great TDD goat and write some simple tests to check that our model gets packaged correctly.
Add in the following script to
import unittest from models import my_spacy_model class TestPackage(unittest.TestCase): def test_packages_nlp_model_correctly(self): # load nlp model nlp = my_spacy_model.load() text = "this is some testing text" doc = nlp(text) self.assertIsNotNone(doc)
It's a very simple test that loads the nlp model and runs it on a test text string.
Of course, we need to also build and install our package, so lets get down to that. We'll use invoke as our task runner, and use it to do the package installation before our test runs.
tasks.py, add in the following task:
from invoke import task @task def test(c, rebuild=False): if rebuild: c.run("poetry build") c.run("pip3 install ./dist/*.whl") # run unit and integration tests c.run("python3 -m unittest discover test")
Note that we are using the great poetry dependency manager tool to simplify our build process. Highly recommended.
If you want to avoid using poetry for any reason (really just use it, it is miles better than
pipenv), you can replace the
poetry build step with the following:
c.run("python3 setup.py bdist_wheel"
And that's a wrap! Hope this post helps!