How to Structure Python Packages and Why

In August 2019 Doug reviewed then-current opinions and recommended practices for Python packaging and arrived at the a collection of recommendations for the directories and files layout of Python packages developed by the UBC EOAS MOAD group.

Doug periodically reviews the Python packaging landscape and updates this section to keep pace with changes in the packaging tools, opinions, and recommended practices.

The most recent review in the November 2022 to January 2023 time frame resulted in significant changes to our package layout. The NEMO-Cmd package was the first package that was converted from the previously used package layout to what is recommended here, so its files are used as examples below.

References

Note

There are 3 key things that people who are familiar with Python packaging need to understand about how the UBC EOAS MOAD group uses its internally developed packages:

  1. We rely heavily on “Editable” Installs. That is, we install our packages from Git clones of the repositories using the python3 -m pip install --editable command. That makes the workflow for getting updates into our installed packages a simple git pull in the package repository clone directory.

  2. On our local workstations and laptops we work in conda environments, either the base environment created by Installing Miniforge, or project-specific environments.

  3. On HPC clusters we install our packages using the “user scheme” for installation in combination with “Editable” Installs, that is, python3 -m pip install --user -e. That enables job scripts running on compute nodes to find our package command-line interfaces without having to include the step of activating the environments in which the packages are installed.

Package Layout

Using the NEMO-Cmd package as an example, the directories and files layout of a MOAD package looks like:

NEMO-Cmd/
├── docs/
│   ├── conf.py
│   ├── index.rst
│   ├── Makefile
│   ├── ...
│   ├── _static/
│   │   └── ...
├── envs/
│   ├── environment-dev.yaml
│   ├── environment-rtd.yaml
│   ├── environment-test.yaml
│   └── requirements.txt
├── .github/
│   ├── ...
├── .gitignore
├── LICENSE
├── nemo_cmd/
│   ├── __about__.py
│   ├── __init__.py
│   ├── ...
├── .pre-commit-config.yaml
├── pyproject.toml
├── README.rst
├── .readthedocs.yaml
└── tests/
    ├── ...

In summary (please see the sections below for detail explanations):

It typically contains 6 files and 5 sub-directories.

The 6 files are described in the Package Files section below.

The 5 sub-directories in all packages are:

The __about__.py file in the Package Code Sub-directory provides the package version identifier string as a variable named __version__.

Top-Level Directory

The name of the top-level directory is the “project name”. It does not have to be the same as the “package name” that you use in import statements. In this example the “project name” is NEMO-Cmd, and the “package name” is nemo_cmd. Other examples of MOAD project and package names are:

  • the moad_tools package is named moad_tools

  • the Reshapr package is named reshapr

  • the SalishSeaTools package is named salishsea_tools

  • the SalishSeaNowcast package is named nowcast

The top-level directory “project name” is generally the name of the project’s Git repository.

Package Files

The sub-sections below describe the 6 files that are typically present in the top-level directory of our packages. Three of those files that must be present contain the information necessary to create a Python package:

  • pyproject.toml File that contains the build system requirements and build backend tools to use for creation of the package, the package metadata, the command-line interface scripts and entry points configuration (if applicable), and configuration for tools used for code QA and package management (e.g. coverage and hatch)

  • README.rst File that provides the long description of the package

  • LICENSE File that contains the legal text of the Apache License, Version 2.0 license for the package

The other three files are perhaps optional, but are present in most packages:

pyproject.toml File

The pyproject.toml file contains:

  • the build system requirements and build backend tools to use for creation of the package

  • the package metadata

  • the command-line interface scripts and entry points configuration (if applicable)

  • configuration for tools used for code QA and package management (e.g. coverage and hatch)

It is documented at https://hatch.pypa.io/latest/config/metadata/.

We use hatchling as our build backend, so the build-system section of our pyproject.toml files always looks like:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

The project section contains the essential metadata required to build the package. Here is an example from the NEMO-Cmd package:

[project]
name = "NEMO-Cmd"
dynamic = [ "version" ]
description = "NEMO Command Processor"
readme = "README.rst"
requires-python = ">=3.10"
license = { file = "LICENSE" }
authors = [
    { name = "Doug Latornell", email = "dlatornell@eoas.ubc.ca" },
]
keywords = ["automation", "oceanography", "ocean modelling", "UBC-MOAD"]
classifiers = [
    "Development Status :: 5 - Production/Stable",
    "License :: OSI Approved :: Apache Software License",
    "Programming Language :: Python :: Implementation :: CPython",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Operating System :: POSIX :: Linux",
    "Operating System :: Unix",
    "Environment :: Console",
    "Intended Audience :: Science/Research",
    "Intended Audience :: Education",
    "Intended Audience :: Developers",
]
dependencies = [
    # see envs/environment-dev.yaml for conda environment dev installation
    # see envs/requirements.txt for versions most recently used in development
    "arrow",
    "attrs",
    "cliff",
    "f90nml",
    "gitpython",
    "python-hglib",
    "pyyaml",
]

The project.urls section contains a table of important URLs for the package. Not all of our packages have a change log, but the URLs for documentation, issue tracker, and source code should exist and be included in this section.

[project.urls]
"Documentation" = "https://nemo-cmd.readthedocs.io/en/latest/"
"Changelog" = "https://nemo-cmd.readthedocs.io/en/latest/CHANGES.html"
"Issue Tracker" = "https://github.com/SalishSeaCast/NEMO-Cmd/issues"
"Source Code" = "https://github.com/SalishSeaCast/NEMO-Cmd"

The optional project.scripts and project.entry-points.plugin-namespace sections define the package’s command-line interface.

In the project.scripts section, the key is the command name and its value is the code object that it will call.

[project.scripts]
nemo = "nemo_cmd.main:main"

The NEMO-Cmd package uses the Cliff package to define its CLI via entry points for each of the sub-command plugins. So, its CLI definition also includes a project.entry-points.nemo section to connect the plugin classes to the main nemo CLI script:

[project.entry-points.nemo]
combine = "nemo_cmd.combine:Combine"
deflate = "nemo_cmd.deflate:Deflate"
gather = "nemo_cmd.gather:Gather"
prepare = "nemo_cmd.prepare:Prepare"
run = "nemo_cmd.run:Run"

Packages like Reshapr that use the Click package to define their CLI only require a project.scripts section because sub-commands are defined via Click.

The tool.hatch.version section contains the path/file in which the package’s version identifier is stored. This makes that file the “single source of truth” for the package’s version, facilitates management of the version identifier with the hatch version command, and enables the use of importlib.metadata.version() to access the package’s version in code.

[tool.hatch.version]
path = "nemo_cmd/__about__.py"

The tool.coverage.run and tool.coverage.report sections provide configuration for the Coverage.py tool that is used to monitor what lines of code the test suite exercises.

[tool.coverage.run]
branch = true
source = [ "nemo_cmd", "tests"]

[tool.coverage.report]
show_missing = true

README.rst File

The README.rst file provides a full description of the package. Take a look some of the UBC EOAS MOAD repositories to get an idea of typical contents. README.rst should include a copyright and license section.

The README.rst file is included as the “long description” of the package via the

readme = "README.rst"

line in the [project] section of the pyproject.toml File.

README files written using reStructuredText (or Markdown) are automatically rendered to HTML in GitHub web pages.

LICENSE File

The LICENSE contains the legal license text for the package. We release all of our open code under the Apache License, Version 2.0

So, you can just copy the LICENSE file from another MOAD repository. The LICENSE file is included in the package metadata via the

license = { file = "LICENSE" }

line in the [project] section of the pyproject.toml File.

.readthedocs.yaml File

For packages that use the readthedocs service to render and host their documentation, we include a .readthedocs.yaml file in the top-level directory (the file name and location are stipulated by readthedocs). That file declares the features of the environment that we want readthedocs to use to build our docs, specifically, a conda environment that we describe in the envs/environment-rtd.yaml file (described below), built using the mambaforge environment and package manager on an Ubuntu Linux virtual machine.

The .readthedocs.yaml file for the NEMO-Cmd package is typical, and looks like:

version: 2

build:
  os: ubuntu-22.04
  tools:
    python: "mambaforge-4.10"

conda:
  environment: envs/environment-rtd.yaml

# Only build HTML and JSON formats
formats: []

.gitignore File

The .gitignore file provides the list of intentionally untracked files that Git should ignore. Having a comprehensive .gitignore file makes commands like git status, git diff, git add, etc. easier to understand and use.

Our .gitignore files are based on files generated by the PyCharm .ignore plugin.

The https://github.com/github/gitignore repository is also a good source for language-specific patterns for .gitignore files.

.pre-commit-config.yaml File

The .pre-commit-config.yaml file provides configuration for the pre-commit tool that is used to manage coding style and other aspects of repository QA in many of our packages. If it is used in a package you should be able to find notes about its use in the Coding Style section of the package’s development docs section; e.g. NEMO-Cmd Coding Style.

Package Sub-Directories

The top-level directory must contain a package sub-directory in which the Python modules that are the package code are stored. There are also usually 3 other sub-directories that contain:

  • the package documentation (docs/)

  • descriptions of the conda environments used for development of the package and building its documentation (envs/)

  • the unit test suite for the package (tests/)

Package Code Sub-directory

The package code sub-directory is where the Python modules that are the package code are stored. Its name is the package name that is used in import statements. In the the NEMO-Cmd package the package sub-directory is named nemo_cmd.

Because the package name is used in import statements it must follow the rules that Python imposes on module names:

  • contain only letters, numbers, and underscores

  • not start with a number

By convention, package sub-directory names are all-lowercase, and use underscores when doing so improves readability. A leading underscore is the convention that indicates a private module, variable, etc., so a package name that starts with an underscore would be unusual and confusing.

The package sub-directory must contain a file called __init__.py (often pronounced “dunder init”). The presence of a __init__.py file is what makes a directory and the Python modules it contains importable.

In MOAD packages the __about__.py file in the package sub-directory contains a declaration of a variable named __version__, for example:

__version__ = "23.1.dev0"

We use a CalVer versioning scheme that conforms to PEP-440. The version identifier format is yy.n[.dev0], where:

  • yy is the (post-2000) year of release

  • n is the number of the release within the year, starting at 1

After a release has been made the value of n is incremented by 1, and .dev0 is appended to the version identifier to indicate changes that will be included in the next release.

The __version__ value is included as the version metadata value in the pyproject.toml File by via the line:

dynamic = [ "version" ]

in the [metadata] section, and the tool.hatch.version section:

[tool.hatch.version]
path = "nemo_cmd/__about__.py"

docs/ Sub-directory

The docs/ directory contains the Sphinx source files for the package documentation. This directory is initialized by creating it, then running the sphinx-quickstart command in it.

After initializing the docs/ directory, its conf.py file requires some editing. Please see docs/conf.py in the NEMO-Cmd package for an example of a “finished” file.

The key things that need to be done are:

  • Add:

    import os
    import sys
    
    sys.path.insert(0, os.path.abspath(".."))
    

    to the # -- Path setup ---------- section of the file to make the package code directory tree available to the Sphinx builder for collection of package metadata, automatic generation of documentation from docstrings, etc.

  • Change the project code in the # -- Project information --------- section to:

    import tomllib
    from pathlib import Path
    
    
    with Path("../pyproject.toml").open("rb") as f:
        pkg_info = tomllib.load(f)
    project = pkg_info["project"]["name"]
    

    to get the project name from the project section of the pyproject.toml File.

  • Change the copyright code in the # -- Project information --------- section to something like:

    author = "SalishSeaCast Project Contributors and The University of British Columbia"
    pkg_creation_year = 2013
    copyright = f"{pkg_creation_year} – present, {author}"
    

    to ensure that the copyright year range displayed in the rendered docs ends with present and the copyright holder is correct.

  • Change the version and release code in the # -- Project information --------- section to something like:

    import importlib.metadata
    
    version = importlib.metadata.version(project)
    release = version
    

    to get the package version identifier from the __version__ variable in the package __about__.py file.

envs/ Sub-directory

The envs/ sub-directory contains at least 4 files that describe the conda environments for the package development and docs building environments.

environment-dev.yaml File

The environment-dev.yaml file is the conda environment description file for the package development environment. It includes all of the packages necessary to install, run, develop, test, and document the package.

For example, the environment-dev.yaml file for the NEMO-Cmd package looks like:

# conda environment description file for NEMO-Cmd package
# development environment
#
# Create a conda environment in which the `nemo` command is installed in editable mode
# with:
#
#   $ conda env create -f NEMO-Cmd/envs/environment-dev.yaml
#   $ conda activate nemo-cmd
#
# The environment includes all the tools used to develop,
# test, and document the NEMO-Cmd package.
#
# See the envs/requirements.txt file for an exhaustive list of all the
# packages installed in the environment and their versions used in
# recent development.

name: nemo-cmd

channels:
  - conda-forge
  - nodefaults

dependencies:
  - arrow
  - attrs
  - cliff
  - f90nml
  - gitpython
  - pip
  - python=3.11
  - pyyaml

  # For coding style, repo QA, and pkg management
  - black
  - hatch
  - pre-commit

  # For unit tests
  - pytest
  - pytest-cov
  - pytest-randomly

  # For documentation
  - sphinx
  - sphinx_rtd_theme
  - sphinx-notfound-page

  - pip:
    - python-hglib

    # install NEMO-Cmd package in editable mode
    - --editable ../
  • The comments at the top of the file include a succinct version of the commands required to create the dev environment.

  • The recommended conda channel to get packages from is conda-forge. nodefaults is included in the channels list to speed up the packages dependency solver because it is now rare for us to require packages from any other source than conda-forge .

  • Packages that are unavailable from conda channels are installed via pip.

The environment-dev.yaml file is “hand-crafted” rather than being generated via the conda env export command. As such, it contains only the top level dependency packages, and only version pins that are absolutely necessary. That allows the conda solver do its job to assemble a consistent set of up-to-date packages to install.

environment-rtd.yaml File

The environment-rtd.yaml file is the conda environment description file for the docs building environment on readthedocs.org. It includes only the packages that are required to build the docs.

For example, the environment-rtd.yaml file for the NEMO-Cmd package looks like:

# conda environment description file for readthedocs build environment

name: nemo-cmd-rtd

channels:
  - conda-forge
  - nodefaults

dependencies:
  - pip
  - python=3.11

  # RTD packages
  - mock
  - pillow
  - sphinx
  - sphinx_rtd_theme
  - sphinx-notfound-page

  - pip:
      # install package so that importlib.metadata functions can work
      - ../

The only reason to add more packages to the dependencies list is if ImportError exceptions that arise in the Sphinx autodoc processing of docstrings can’t be resolved by the use of the autodoc_mock_imports list in conf.py.

environment-test.yaml File

The environment-test.yaml file is the conda environment description file for an environment configured specifically for running the package’s unit test suite, and the sphinx linkcheck command. It is primarily used by GitHub Actions workflows that are run whenever commits are pushed to GitHub. Please see the .github/workflows/ Sub-directory section.

requirements.txt File

The requirements.txt file records the full list of packages in the development environment and their versions used for recent development work. It is generated using the python3 -m pip list --format=freeze command. When new package dependencies are added to the project, or the dev environment is updated via conda env update --file envs/environment-dev.yaml, a new requirements.txt file should be generated and merged with the previously committed version so that the dev environment changes are tracked by Git.

tests/ Sub-directory

The tests/ sub-directory contains the unit test suite for the package. Its modules match the names of the modules in the Package Code Sub-directory, but with test_ pre-pended to them. If the Package Code Sub-directory contains sub-directories, those sub-directories are reflected in the tests/ tree.

The tests/ sub-directory, nor any other directories that may be created in its tree should not contain __init__.py files. Please see the discussion of test layout/import rules in the pytest docs for explanation.

.github/ Sub-directory

The .github/ sub-directory contains files that configure automation on the GitHub servers that supports repository QA. In that sub-directory there is typically a dependabot.yaml file and a workflows/ sub-directory.

The dependabot.yaml file configures the GitHub Dependabot tool to check weekly for version updates on GitHub actions packages used in the other automation workflows and open pull requests to apply those updates.

.github/workflows/ Sub-directory

The workflows/ sub-directory contains configuration files for GitHub Actions workflows used for code and docs QA tasks like:

  • static analysis of the code using GitHub CodeQL to detect possible security vulnerabilities

  • run the package test suite with code coverage analysis (continuous integration) and pushing the coverage analysis report to Codecov/_

  • run the Sphinx linkcheck builder on the documentation files to scans for external links, try to open them, and write report that highlights any links that are broken or redirected

The workflows are generally run automatically each time that commits are pushed to the repository. Some, like the CodeQL security scan, and the Sphinx link check are also run on a schedule to detect vulnerabilities and broken links that arise between code and docs updates.

In most cases the workflows are based on our collection of shared reusable workflows in the UBC-MOAD/gha-workflows repository.

Rationale

This section explains the rationale for important choices in the packaging layout and methodology described above.

The changes that resulted from Doug’s December 2022 review of then current opinions and recommended practices for Python packaging are:

  • Start using the pyproject.toml File in packages to contain all of the package metadata. That eliminates the setup.py and setup.cfg files. It also changes how the package name and version are accessed in the conf.py file in the docs/ Sub-directory, and how the package version is accessed elsewhere in code (such as for the --version option in packages with command-line interfaces).

  • Define the package version identifier in the __about__.py file in the Package Code Sub-directory.

  • Change to use hatchling as our package build backend because it is more modern than setuptools and provides additional package management tools like hatch version. It also appears to continue to use the .pth scheme for editable mode installs, in contrast to setuptools>=64.0.0 possibility of using a custom import hook scheme that is not supported by IDEs like PyCharm and VS Code.

While the src/ layout advocated by Hynek Schlawack’s Testing & Packaging blog post and Ionel Cristian Mărieș’s Packaging a python library blog post is now also recommended by the Python Packaging Authority in the Python Packaging User Guide, we have not yet adopted it. The benefits that src/ layout provides are not important to us because we always install our group-developed packages via python3 -m pip install -e, and we don’t use tox to test our packages with different Python versions and interpreters.