1- Launching

We will see how to modify the file main.py to use MLXP using the decorator mlxp.launch. But first, let’s introduce the mlxp.Context object produced by the decorator mlxp.launch that allows using MLXP’s logging and configuring functionalities.

The Context object

MLXP uses an object ctx of the class mlxp.Context that is created on the fly during the execution of the program to store information about the run. More precisely, it contains 4 fields:

  • ctx.config: Stores project-specific options provided by the user. These options are loaded from a yaml file called config.yaml located in a directory config_path provided as input to the decorator mlxp.launch.

  • ctx.mlxp: Stores MLXP’s default settings for the project. Its content is loaded from a yaml file mlxp.yaml located in the same directory config_path.

  • ctx.info: Contains information about the current run: ex. status, start time, hostname, etc.

  • ctx.logger: A logger object that can be used in the code for logging variables (metrics, checkpoints, artifacts). When logging is enabled, these variables are all stored in a uniquely defined directory.

General setup

Defining a default config file

The first step is to provide all default options that will be used by the code in a separate Yaml file named config.yaml and contained in the ./configs directory.

./configs/config.yaml
seed: 0
num_epoch: 10
model:
 num_units: 100
data:
 d_int: 10
 device: 'cpu'
optimizer:
 lr: 10.

Here, we stored all options that were provided as input to the function train in the main.py file (such as the learning rate lr, number of epochs num_epochs, etc) into a structured Yaml file. The user has the freedom to define their own structure: for instance, here we chose to group the input dimension d_int and device into the same data group, but other (probably better choices) are possible. MLXP will load this file by default, just like in hydra and provide these options as a hierachical dictionary to be used in the code (more about this later!).

Adapting code for using MLXP

To use MLXP, we only need to slightly change the main.py file. The first step is to import MLXP and use the decorator mlxp.launch above the function train. We also need to change the signature of the function train so that it can accept an object ctx of type mlxp.Context as an argument instead of the variables. Note, however, that train is called later without explicitly passing any argument. The remaining modifications are:

  • Using the option values stored in ctx.config as a replacement to the variables provided in the older version of the code (See: the old ‘main.py’ file).

  • Using the logger ctx.logger to store the results of the run (instead of printing them) and saving checkpoints.

Here is how the code would look like:

main.py
import torch
from core import DataLoader, OneHiddenLayer

import mlxp

@mlxp.launch(config_path='./configs')
def train(ctx: mlxp.Context)->None:

    cfg = ctx.config
    logger = ctx.logger

    start_epoch = 0

    # Building model, optimizer and data loader.
    model = OneHiddenLayer(d_int=cfg.data.d_int,
                            n_units = cfg.model.num_units)
    model = model.to(cfg.data.device)
    optimizer = torch.optim.SGD(model.parameters(),
                                lr=cfg.optimizer.lr)
    dataloader = DataLoader(cfg.data.d_int,
                            cfg.data.device)

    # Training
    for epoch in range(start_epoch,cfg.num_epoch):

        train_err = train_epoch(dataloader,
                                model,
                                optimizer)

        logger.log_metrics({'loss': train_err.item(),
                            'epoch': epoch}, log_name='train')

        logger.log_checkpoint({'model': model,
                               'epoch':epoch}, log_name='last_ckpt' )

    print(f"Completed training with learing rate: {cfg.optimizer.lr}")

if __name__ == "__main__":
    train()

Seeding code using MLXP

In our example, the initialization of the model uses random initial parameters which might change from one run to another. To avoid this, the user can provide a function seeding_function to the mlxp.launch decorator to set the global seeds of whatever random number generator is used.

main.py
import mlxp
from core import DataLoader, Network, Optimizer, Loss

def seeding_function(seed):
    import torch
    torch.manual_seed(seed)

@mlxp.launch(config_path='./configs',
            seeding_function=seeding_function)
def train(ctx: mlxp.Context)->None:

    cfg = ctx.config
    logger = ctx.logger

    ...

if __name__ == "__main__":
    train()

The function seeding_function will be called by MLXP before executing the function train. The parameter seed is read from the user-defined option: ctx.config.seed. If the field seed is not provided by the user and a seeding function is passed, then the code throws an error. Note that the field seed passed to the seeding_function can be an integer or a dictionary or any object that can be stored in a yaml file. Of course, it is also possible to perform seeding inside the function train, but seeding_function allows you to do it systematically.

Launching locally using MLXP

During execution, the default configurations will be read from the file config.yaml located in the directory ./configs and passed to the object ctx.config. The code will be executed using these option:

$ python main.py
Completed training with learning rate: 10.0

Just like with hydra, we can run the code again with different options by overriding the default ones from the command line. For instance, we can use different learning rates and even select multiple values for it (say: 1e-2 and 1e-1). we can do this from the command line by providing multiple values (0.01,0.1) to the option optimizer.lr:

$ python main.py optimizer.lr=0.01,0.1
Completed training with learning rate: 0.01
Completed training with learning rate: 0.1

In the above instruction, we added an option optimizer.lr=0.01,0.1 which execute the code twice: once using a learning rate of 0.01 and a second time using 0.1 .

Launching jobs to a scheduler

If you have access to an HPC cluster, then you probably use a job scheduler for submitting jobs. MLXP allows you to combine the ‘multirun’ capabilities of hydra with job scheduling to easily submit multiple experiments to a cluster. Currently, MLXP supports the following job schedulers: SLURM, OAR, TORQUE, SGE, MWM and LSF.

Submitting jobs to a job scheduler

Let’s say, you’d like to submit multiple jobs into a job scheduler. You can do this easily using the mlxpsub command!

The first step is to create a script ex.: script.sh in your working directory (here under my_project/). In this script, you can define the resources allocated to your jobs, using the syntax of your job scheduler, as well as the python command for exectuting your main python script. You can then pass different option values to your python script main.py as discussed earlier in the launching tutorial:

#!/bin/bash

#OAR -l core=1, walltime=6:00:00
#OAR -t besteffort
#OAR -t idempotent

python main.py  optimizer.lr=10.,1. seed=1,2
python main.py  model.num_units=100,200 seed=1,2

The above script is meant to create and exectute 8 jobs in total that will be submitted to an OAR job scheduler. The first 4 jobs correspond to the first python command using all possible combinations of option values for optimizer.lr and seed: (10.,1) , (10.,2), (1.,1), (1.,2). The 4 next jobs are for the second command wich varies the options model.num_units and seed.

You only need to run the following command in the terminal:

mlxpsub script.sh

What happens under the woods?

Here is what happens:

  1. mlxpsub command parses the script to extract the scheduler’s instructions and figures out what scheduler is used, then provides those information as a context prior to executing the script.

  2. hydra performs a cross-product of the options provided and creates as many jobs are needed.

  3. The MLXP creates a separate directory for each one of these jobs. Each directory is assigned a unique log_id and contains a script to be submitted.

  4. All generated scripts are submitted to the job scheduler.

What you should expect?

MLXP creates a script for each job corresponding to an option setting. Each script is located in a directory of the form parent_log/log_id, where log_id is automatically assigned by MLXP for each job. Here is an example of the first created script in logs/1/script.sh where the user sets parent_log to logs.

#!/bin/bash
#OAR -n logs/1
#OAR -E /root/logs/1/log.stderr
#OAR -O /root/logs/1/log.stdout
#OAR -l core=1, walltime=6:00:00
#OAR -t besteffort
#OAR -t idempotent

cd /root/workdir/
python main.py  optimizer.lr=10. seed=1

As you can see, MLXP automatically assigns values for the job’s name, stdout and stderr file paths, so there is no need to specify those in the original script script.sh. These scripts contain the same scheduler’s options as in script.sh in addition to a single python command specific to the option setting: optimizer.lr=10. seed=1. Additionally, MLXP pre-processes the python command to extract the working directory and sets it explicitly in the newly created script before the python command.

Once, the job finishes execution, we can double-check that everything went well by inspecting the directory logs/1/ which should contain the usual logs and two additional files log.stdout and log.stderr:

./logs/
logs/
├── 1/
│   ├── metadata/
│   │   ├── config.yaml
│   │   ├── info.yaml
│   │   └── mlxp.yaml
│   ├── metrics/
│   │   ├── train.json
│   │   └── .keys/
│   │        └── metrics.yaml
│   ├── artifacts/
│   │   └── Checkpoint/
│   │       └── last_ckpt.pkl
│   ├── log.stderr
│   ├── log.stdout
│   └── script.sh
│
├──...