Tools and tenets for ML and Python

Tools and tenets
for ML and Python

Predictive Analytics Lab
University of Sussex

Approach

Python tools

ML tools

Further recommendations

Core tenets

Libraries > Snippets
Automation > Manual

Python Tools

Poetry
Hydra
Ray

ML Tools

PyTorch Lightning
Weights & Biases

Not Covered (but recommended)

BentoML (for model-serving)
Optuna (for hyperparameter optimization) – has integration with Hydra

Core tenets

Avoid boilerplate code
- The best code is code you don't have to write.
Put shared code into libraries
- EthicML
- PALkit
Automation > Manual
- Try to make your model run end-to-end
- Keep hard-coding to a minimum

Type-annotate everything
- readable and less error-prone code
- use static type checkers like pyright and mypy
Make it easy for others to run your code
- Specify exact dependencies
- Don’t use notebooks
- Run and configure via the command line

Make it easy to collaborate
- Use git
- Make PRs and ask for reviews

Python Tools

Poetry — Simple, Conflict-free
Dependency Management

Poetry is a tool for managing Python dependencies

Alternative to setup.py
Automatically resolves meta-dependencies (dependencies between dependencies)
Maintains a lock file that ensures that all people working on the project are locked to the same versions of dependencies
Also provides some protection if you forget to activate a venv

pyproject.toml replaces setup.py and also auxiliary .cfg files such as black.cfg and .isort.cfg

Install poetry from the website or homebrew

Useful commands

poetry install – install dependencies

poetry update – check for dependency updates that won’t break your code

poetry add <package> – installs and adds a new dependency (no need to manually code dependencies as required for requirements.txt/setup.py)

Analogy (for rustaceans): cargo for python

Hydra — elegant and flexible configuration

hydra logo

Hydra is a tool for configuring complex applications
“Complex” means something like more than 10 flags

Hydra enables configuration via YAML files and allows overriding any configuration value on the commandline
Hydra encourages modular configuration
- E.g. data loading config and model config is separate
- Config modules can be swapped out
- Supports validation and variable interpolation

Hydra supports multiruns where multiple parameter values are run in a combinatorial fashion
Hydra also has plugins that allow for hyperparameter sweeps to be conducted using popular HPO libraries (e.g. Optuna)

Hydra can also instantiate Python objects based on configuration values

Ray — effortless parallelism

ray logo

Ray is a tool for easy parallelisation of Python functions
It can be used to a hyperparameter search over multiple GPUs and multiple machines

Ray usually parallelises a single Python function, but combined with Hydra, it parallelises your whole application
Ray can act as a queue for jobs
Ray is GPU-aware and can distribute jobs over multiple machines according to how many GPUs are available on the machines

ML Tools

PyTorch Lightning — avoid boilerplate code

pytorch lightning logo

Lightning provides a set of common abstractions that you see in PyTorch models
Gives many things for free
- Logging (e.g. to W&B)
- Distributed training
- Automatic LR and batch-size determination

Lightning is made up of 3 key components: a DataModule, LightningModule, and a Trainer

DataModule is a container for train, val and test dataloaders
LightningModule is a nn.Module but you also define the training, val and test steps
Trainer abstracts away the boilerplate code of the training loop