Tools and tenets for ML and Python

 

Tools and tenets
for ML and Python

 

Predictive Analytics Lab
University of Sussex

 

Contents

Approach

Python tools

ML tools

Further recommendations

Contents

Core tenets

  • Libraries > Snippets
  • Automation > Manual

Python Tools

  • Poetry
  • Hydra
  • Ray

ML Tools

  • PyTorch Lightning
  • Weights & Biases

Not Covered (but recommended)

  • BentoML (for model-serving)
  • Optuna (for hyperparameter optimization) – has integration with Hydra

 

 

 

 

Core tenets

Core tenets

  • Avoid boilerplate code
    • The best code is code you don't have to write.
  • Put shared code into libraries
    • EthicML
    • PALkit
  • Automation > Manual
    • Try to make your model run end-to-end
    • Keep hard-coding to a minimum
  • Type-annotate everything
    • readable and less error-prone code
    • use static type checkers like pyright and mypy
  • Make it easy for others to run your code
    • Specify exact dependencies
    • Don’t use notebooks
    • Run and configure via the command line
  • Make it easy to collaborate
    • Use git
    • Make PRs and ask for reviews

 

 

 

 

Python Tools

Poetry — Simple, Conflict-free
Dependency Management

poetry logo
  • Poetry is a tool for managing Python dependencies
  • Alternative to setup.py
  • Automatically resolves meta-dependencies (dependencies between dependencies)
  • Maintains a lock file that ensures that all people working on the project are locked to the same versions of dependencies
  • Also provides some protection if you forget to activate a venv

pyproject.toml replaces setup.py and also auxiliary .cfg files such as black.cfg and .isort.cfg

Install poetry from the website or homebrew

Useful commands

poetry install – install dependencies

poetry update – check for dependency updates that won’t break your code

poetry add <package> – installs and adds a new dependency (no need to manually code dependencies as required for requirements.txt/setup.py)

Analogy (for rustaceans): cargo for python

Hydra — elegant and flexible configuration

  hydra logo

 

  • Hydra is a tool for configuring complex applications
  • “Complex” means something like more than 10 flags
  • Hydra enables configuration via YAML files and allows overriding any configuration value on the commandline
  • Hydra encourages modular configuration
    • E.g. data loading config and model config is separate
    • Config modules can be swapped out
    • Supports validation and variable interpolation
  • Hydra supports multiruns where multiple parameter values are run in a combinatorial fashion
  • Hydra also has plugins that allow for hyperparameter sweeps to be conducted using popular HPO libraries (e.g. Optuna)
  • Hydra can also instantiate Python objects based on configuration values

Ray — effortless parallelism

ray logo

  • Ray is a tool for easy parallelisation of Python functions
  • It can be used to a hyperparameter search over multiple GPUs and multiple machines
  • Ray usually parallelises a single Python function, but combined with Hydra, it parallelises your whole application
  • Ray can act as a queue for jobs
  • Ray is GPU-aware and can distribute jobs over multiple machines according to how many GPUs are available on the machines

 

 

 

 

ML Tools

PyTorch Lightning — avoid boilerplate code

pytorch lightning logo

  • Lightning provides a set of common abstractions that you see in PyTorch models
  • Gives many things for free
    • Logging (e.g. to W&B)
    • Distributed training
    • Automatic LR and batch-size determination

Lightning is made up of 3 key components: a DataModule, LightningModule, and a Trainer

  • DataModule is a container for train, val and test dataloaders
  • LightningModule is a nn.Module but you also define the training, val and test steps
  • Trainer abstracts away the boilerplate code of the training loop