About Vantage 101

Welcome to Vantage 101, the machine learning engineering course that bridges the gap between university and the business world. This course is designed to quickly get a grasp on all best practices you wish you would have known at the beginning of your career.

GitHub-Logo Check out the GitHub repo here

Questions or feedback? Let us know!

Chapter 7: Packaging code

Now that you’ve cleaned up your code, organized everything into classes and written tests for your code it’s time to show your work off to your friends, colleagues and perhaps the entire world. As you’ve surely realized by now, building further upon code that has been written by others and made available for the world to use, like Pandas and NumPy, is an essential code development practice. Just imagine the hassle it would be to complete a project without using any import statements. In this chapter we’ll dive into python packages, what they are, and how you can use them to become a more efficient developer.

7.1 What is a Python package

To understand how packages are built up, it is useful to make a distinction between a few terms.

  • Script: A script is a python file intended to be executed directly. That is, any file with a .py extension that actually does something when you run it.
  • Module: A module is a python file intended to be imported into scripts or other modules. A module cannot be executed as a standalone file, as will not do anything by itself. It generally contains functions, classes, and variables that can be imported and used in other python scripts or modules.
  • Package: A package is a collection of modules, generally bundled together to provide certain functionality.
  • Library: A library is essentially the same as a package. However, a library generally refers to a large published python package such as NumPy or Pandas.

For a more in depth description of python packages please continue reading here.

7.2 Python import statements

Methods, packages and libraries are all reusable blocks that can be imported into any other python file. Let’s say you have defined some classes and functions and put them all into a python file called my_module.py. Now you’d like to use these classes and functions in my_script.py. In Python, modules and packages are always imported through import statements.

An import statement seems very simple, however there are quite some things happening “under the hood” when you execute this statement. Although it is not necessary to know this part as long as everything works, it can be very useful to understand what is happening when you end up running into import errors, which unfortunately are part of every Machine Learning Engineer’s life.

When you execute an import statement, e.g. import my_module, your Python interpreter starts looking for a file called my_module.py in a list of directories. This list of directories is composed of three parts.

  1. The folder in which the script with the import statement is located
  2. All the locations defined in your PYTHONPATH environment variable
  3. A "default" list of locations defined when installing Python

Other forms of the import statement are:

  • import <module_name>
  • from <module_name> import <name(s)>
  • from <module_name> import <name> as <alt_name>
  • import <module_name> as <alt_name>

Although it goes beyond the basics of python packaging, it can be very useful and very time-saving to properly understand what an import statement does when you run into an error. A good in-depth explanation of Python import statements can be found here and of course in the documentation.

As a fun sidenote, I'd like to point out two easter eggs in the core python code. Try out the following two import statements in your python terminal:

import this
import antigravity

7.3 How to use modules and packages in your project

Now that we have defined what a Python package is, let’s take a look at how you can make your life easier by reusing your own code.

There are three approaches to reusing your code:

  1. Reusing a single module
  2. Reusing multiple modules in a package
  3. Reusing multiple modules by installing a package (recommended)

The first step to reusing your code could be to store some functions that you want to use in multiple scripts into their own python file. Thereby creating a module. You could then import this module into your separate scripts and easily reuse the functions. A common approach is to create a src.py or utils.py file which contains all the functions you’d like to use multiple times. This will work fine if you need a quick and dirty approach, but is not recommended as a long term solution.

Awesome_data_science_project
├── notebook(1).ipynb
├── notebook(2).ipynb
└── utils.py

The next step is to turn this module, or multiple modules, and give them a place of their own. This will keep your code tidy and will allow you to easily copy the package to a different project if you want to use the package there as well. Notice the sudden appearance of the __init__.py file. This is a special file in Python and it tells your Python interpreter that this folder contains one or more modules. The __init__.py file can be completely empty, it just needs to be in the folder, so make sure you add it.

Awesome_data_science_project
├── notebook(1).ipynb
├── notebook(2).ipynb
└── src
│   ├── __init__.py
│   ├── utils(a).py
└───└── utils(b).py

This structure will work great as long as you keep your package in the same directory as the notebook you're trying to import it into. Because, when you run the import statement the python interpreter will look for for your package in this folder. However, as projects grow this will soon lead to a messy codebase and that's not something we want. Therefore, we'd like to have a package that we can place anywhere in the codebase such that we can organise it in a way that makes sense to us. This can be done by installing the package.

Using setuptools

A common approach to installing packages is using setuptools. There are two new files that we need to introduce to our codebase: pyproject.toml and setup.cfg.

Awesome_data_science_project
├── notebooks
│ ├── notebook(1).ipynb
│ ├── notebook(2).ipynb
├── src
│ ├── __init__.py
│ ├── utils(a).py
│ ├── utils(b).py
├── setup.cfg
└── pyproject.toml

Minimal configuration for these files are as follows:

pyproject.toml:

[build-system]
requires  =  ["setuptools"]
build-backend  =  "setuptools.build_meta"

The pyproject.toml file declares the build system dependencies, and which library will be used to actually do the packaging.

setup.cfg:

[metadata]
name = mypackage
version = 0.0.1

[options]
install_requires =
    requests
    importlib-metadata; python_version < "3.8"

The setup.cfg file declares all other information about the package, such as the version, contents, dependencies, etc. Here we specify the package name and version, and the required dependencies. Read more about configuring a setup.cfg file here.

Note that all the information about a package that we placed in setup.cfg can also be written down in the same pyproject.toml file that contains the build system dependencies. However, as of writing this article (setuptools version 64.0.1), that feature is still in beta. You can read more about this option here.

Now that we know what these files look like and what they contain it is of course important to know what use they are to us. Adding these files, which amongst other contain the instructions on how this package should be build, allows to actually install the package. This can be done as easily by navigating to the package directory and running pip install .. The . is the path to the package directory, so of course it is also possible to install it without navigating to the directory first as long as your path is pointing to the directory.

While developing your package it is recommended to install your package in editable mode with pip install --editable .. This allows you to change and run your code without reinstalling your package every time. Read more about how this works here.

After you have installed your own package in your own virtual environment you can import your modules in all your script regardless of that scripts location. This allows you keep your clean and organized.

Legacy mode with setup.py

Before the standard for setuptools (and Python in general) became to use a pyproject.toml file and setup.cfg file, the practice was to use a setup.py file. It is no longer recommended to use this approach, however it is good to understand how it works if you encounter a setup.py file in the wild. You can read more about rationale behind the switch from setup.py to pyproject.toml and setup.cfg here.

Similarly to the pyproject.toml and setup.cfg files, a setup.py allows us to write instructions for the pip package manager about how to install our package. A minimal example to get things up and running is this:

from setuptools import setup, find_packages

setup(
  name='src',
  version='0.1.0',
  packages=find_packages()
)
Awesome_data_science_project
├── notebooks
│ ├── notebook(1).ipynb
│ ├── notebook(2).ipynb
├── src
│ ├── __init__.py
│ ├── utils(a).py
│ ├── utils(b).py
└── setup.py

7.5 Assignment:

  1. Create a package named src and place all your reusable functions and classes into this package. Feel free to separate your code into separate modules as you see fit.
  2. Make a pyproject.toml and setup.cfg file and install your package.
  3. For the remainder of this course, continue adding new logic into this or other new packages in your codebase.