Deploying a Trained SpaCy Model As a Python Sub-Package

Last updated on October 31, 2020

So you've trained your spaCy NLP model, you've got it to the highest accuracy you can push it to, and now you want to use it in your project or product. Now, what do you do?

If we were to take a glance at the official spaCy documentation for packaging, they advise using their command line interface to build a deployable Python package containing the model. This results in a package that can then be imported as follows:

import my_model_name

nlp = my_model_name.load()

However, that isn't good enough for us, organization-wise. Imagine having to manage so many separate packages for each machine learning model that you want to deploy. Just think of those import statements! Oh the horror! shudders

The perfect way to organize our machine learning models would be to have an overarching package that we can then import as so:

from my_models import my_nlp_model, my_log_reg_model, my_mv_reg_model

So ideally, we would want to have the spaCy model as a sub-package, instead of having it as a whole other package that we have to install as a separate dependency.

Investigating spaCy's Source Code

Most thankfully, spaCy's Github repo is quite sane-ly organized, with their cli-related scripts mapping one-to-one with each cli command. What we are interested, is to find out what exactly the package command does to our trained and saved spaCy model. The source code is available here for you to check it out yourself.

Notable things that we can see, is that the script basically performs file copying of the key model files, creates the following key files:

  • a meta.json containing the model metadata
  • a setup.py file for the final python packaging step
  • a MANIFEST.in file
  • a __init__.py to signify that the model is a package.

We can also see that there are template strings used to creat the above files, which are placed at the bottom of the package script.

As we want to have the model as a sub-package, we are mostly interested with the __init__.py file, as we need to find out how the model is loaded when imported when the package CLI command is used.

We can see the following under the TEMPLATE_INIT variable:

# coding: utf8
from __future__ import unicode_literals
from pathlib import Path
from spacy.util import load_model_from_init_py, get_model_meta
__version__ = get_model_meta(Path(__file__).parent)['version']
def load(**overrides):
    return load_model_from_init_py(__file__, **overrides)

Hot on trail, we need to find out what this load_model_from_init_py function does in the util.py module.

Referencing the function definition, we can see the following:

def load_model_from_init_py(init_file, **overrides):
    """Helper function to use in the `load()` method of a model package's
    __init__.py.
    init_file (unicode): Path to model's __init__.py, i.e. `__file__`.
    **overrides: Specific overrides, like pipeline components to disable.
    RETURNS (Language): `Language` class with loaded model.
    """
    model_path = Path(init_file).parent
    meta = get_model_meta(model_path)
    data_dir = "%s_%s-%s" % (meta["lang"], meta["name"], meta["version"])
    data_path = model_path / data_dir
    if not model_path.exists():
        raise IOError(Errors.E052.format(path=path2str(data_path)))
    return load_model_from_path(data_path, meta, **overrides)

Basically, what the function does is to extract and reference the model by path, before handing off the model loading to the load_model_from_path function.

Hence, when we load our spaCy model in our own package, we will need to do the same, sans the templated directory name, as we will not be using the spaCy cli to do the file copying.

Down to Business: Creating Our SpaCy Sub-Package

Place your trained model data directory into a sub-directory under your main source folder.

For example, if our eventual package is to be called models, then organize your files as follows:

models
|- my_spacy_model
    |- meta.json
    |-  __init__.py
    |- ...
|- __init__.py

Notice that there are two __init__.py files, one in the main package directory, and another in the SpaCy model directory. This basically signals to Python that you want these directories to be package directories, and will execute the __init__.py file whenever you import these packages.

In the SpaCy model's __init__.py file, we're going to add in this snippet of code:

from __future__ import unicode_literals
from pathlib import Path
from spacy.util import load_model_from_path, get_model_meta
# __version__ = get_model_meta(Path(__file__).parent)['version']
def load(**overrides):
    init_file = Path(__file__)
    model_path = Path(__file__).parent
    meta = get_model_meta(model_path)
    if not model_path.exists():
        raise IOError(Errors.E052.format(path=path2str(model_path)))
    return load_model_from_path(model_path, meta, **overrides)

Looks familiar? Hell yeah, of course it does, it is adapted from the “official” SpaCy package script. What we are basically doing, is referencing the current __init__.py file, and using it to get our model path, and then retrieve the model metadata and pass off the model loading to the same load_model_from_path function that SpaCy uses internally. This will happen each time we run the function my_spacy_model.load(), and the overrides kwargs that we can ass to the model will also be the exact same as if we are loading up the model with spacy.load(MODEL_PATH)

We can then use the model as follows:

from models import my_spacy_model

nlp = my_spacy_model.load()

Writing Tests For Our Package (Optional)

Of course, what kind of software developer would we be if we didn't test our desired outcome? Yes, that's right, we would be terrible software developers. Let's listen to the great TDD goat and write some simple tests to check that our model gets packaged correctly.

Add in the following script to test_package.py:

import unittest
from models import my_spacy_model

class TestPackage(unittest.TestCase):
    def test_packages_nlp_model_correctly(self):
        # load nlp model
        nlp = my_spacy_model.load()
        text = "this is some testing text"
        doc = nlp(text)
        self.assertIsNotNone(doc)

It's a very simple test that loads the nlp model and runs it on a test text string.

Of course, we need to also build and install our package, so lets get down to that. We'll use invoke as our task runner, and use it to do the package installation before our test runs.

In your tasks.py, add in the following task:

from invoke import task

@task
def test(c, rebuild=False):
    if rebuild:
        c.run("poetry build")
        c.run("pip3 install ./dist/*.whl")

    # run unit and integration tests
    c.run("python3 -m unittest discover test")

Note that we are using the great poetry dependency manager tool to simplify our build process. Highly recommended.

If you want to avoid using poetry for any reason (really just use it, it is miles better than pipenv), you can replace the poetry build step with the following:

c.run("python3 setup.py bdist_wheel"

And that's a wrap! Hope this post helps!