2- Logging

Logging metrics and artifacts

The logger provides four methods for logging objects:

  • log_metrics: For logging dictionaries of scalars in a json file. This method can be used to log the loss and other scalar quantities that can evolve during the run. These dictionaries are stored in a json file.

  • log_artifacts: For logging more complex objects such as the weights of a network, etc. This method requires passing the desired artifact format (ex: pickle, image, torch checkpoint) and the name of the artifact.

  • samp:

    load_artifacts: For lading artifacts.

  • log_checkpoint: A simpler method for logging serializable objects as a pickle file.

  • load_checkpoint: A method for loading a saved checkpoint.

Logging metrics

In the main.py file, we have added a new line to log the loss at each epoch using the method log_metrics:. This methods takes a dictionary of scalars as inputs as well as the name of the JSON file where it will be stored:

main.py
...
logger.log_metrics({'loss': train_err.item(),
                    'epoch': epoch}, log_name='train')
...

Note

Several dictonaries can be stored successively in the same JSON file even if they do not have the same keys.

Warning

The keys of the dictionary must be consistent accross runs and within each JSON file.

Logging artifacts

We also added a line to log the model’s weights using the method log_checkpoint. This method is used to log serializable objects as pickle files. The name of the pickle file is provided as an argument.

main.py
...
logger.log_checkpoint({'model': model,
                       'epoch':epoch}, log_name='last_ckpt')
...

For more general artifact, you can use the method log_artifacts which takes the artifact format and the name of the artifact as arguments. For instance, below we log the model’s weights as a torch checkpoint:

main.py
...
logger.log_artifacts({'model': model,
                      'epoch':epoch},
                      artifact_name='last_ckpt',
                      artifact_format='torch')
...

The method log_artifacts natively supports the following types: pickle, torch, image, numpy.

Registering custom artifacts

In case other non supported artifacts need to be logged, the user can register custom artifact types. This is done using the method register_artifact_type which takes three arguments: the name of the artifact type, the method for saving the artifact, and the method for loading the artifact:

main.py
...
def save(obj,path):
    import pickle
    with open(path, 'wb') as f:
        pickle.dump(obj, f)

def load(path):
    import pickle
    with open(path, 'rb') as f:
        return pickle.load(f)


logger.register_artifact_type('my_pickle', save, load)
...

logger.log_artifacts({'model': model,
                      'epoch':epoch},
                      artifact_name='last_ckpt',
                      artifact_format='my_pickle')

The method register_artifact_type must be called before the method log_artifacts is used with the new type.

File structure of the logs

When the logger is activated, it stores the results of a run in a sub-directory of the parent directory ./logs. This parent directory is created automatically if it does not exists already. By default it is set to ./logs, but this behavior can be modified (see Customizing the parent log directory).

First, the logger assigns a log_id to the run. Every time main.py is executed with an active logger, the log_id of the new run is incremented by 1 starting from 1. Then a new sub-directory of ./logs is created and named after the assigned log_id. Since we executed the code three times in total, we should expect three sub-directories under ./logs called 1, 2 and 3, all having the same structure:

./logs/
logs/
├── 1/...
├── 2/...
└── 3/...

Each log directory contains three sub-directories: metadata, metrics and artifacts:

./logs/
logs/
├── 1/
│   ├── metadata/
│   │   ├── config.yaml
│   │   ├── info.yaml
│   │   └── mlxp.yaml
│   ├── metrics/
│   │   ├── train.json
│   │   └──.keys/
│   │       └── metrics.yaml
│   └── artifacts/
│       └── pickle/
│           └── last_ckpt.pkl
│
├── 2/...
└── 3/...

Let’s go through these three directories.

The metrics directory

This directory contains JSON files created when calling the logger’s method log_metrics(dict, log_name). Each file is named after the variable log_name and stores the dictionaries provided as input to the log_metrics method.

./logs/1/metrics/train.json
{"loss": 0.030253788456320763, "epoch": 0}
{"loss": 0.02899891696870327, "epoch": 1}
{"loss": 0.026649776846170425, "epoch": 2}
{"loss": 0.023483652621507645, "epoch": 3}
{"loss": 0.019827445968985558, "epoch": 4}
{"loss": 0.01599641889333725, "epoch": 5}
{"loss": 0.012259905226528645, "epoch": 6}
{"loss": 0.008839688263833523, "epoch": 7}
{"loss": 0.005932427477091551, "epoch": 8}
{"loss": 0.003738593542948365, "epoch": 9}

The hidden directory .keys is used by the reader module of MLXP and is not something to worry about here. Instead, we inspect the remaining directories below.

The metadata directory

The metadata directory contains two yaml files: config and info`, each storing the content of the corresponding fields of the context object ctx. config stores the user config of the run, info stores general information about the run such as the assigned log_id and the absolute path to the logs of the run log_dir.

./logs/1/metadata/config.yaml
seed: 0
num_epoch: 10
model:
 num_units: 100
data:
 d_int: 10
 device: 'cpu'
optimizer:
 lr: 10.
./logs/1/metadata/info.yaml
executable: absolute_path_to/bin/python
cmd: ''
end_date: 20/04/2023
end_time: '16:01:13'
current_file_path: absolute_path_to/main.py
log_dir: absolute_path_to/logs/1
log_id: 1
process_id: 7100
start_date: 20/04/2023
start_time: '16:01:13'
status: COMPLETE
user: marbel
work_dir: absolute_path_to/tutorial

The artifacts directory

The directory artifacts is where all data passed to the logger’s methods log_artifacts and log_checkpoint are stored. These are stored in different directories depending on the artifact type. In this example, since we used the reserved method log_checkpoint, the logged data are considered as pickle objects, hence the sub-directory pickle. You can see that it contains the pickle file last_ckpt.pkl which is the name we provided when calling the method log_checkpoint in the main.py file.

Checkpointing

Checkpointing can be particularly useful if you need to restart a job from its latest state without having to re-run it from scratch. To do this, you only need to slightly modify the function train to load the latest checkpoint by default:

main.py
import torch
from core import DataLoader, OneHiddenLayer

import mlxp

@mlxp.launch(config_path='./configs')
def train(ctx: mlxp.Context)->None:

    cfg = ctx.config
    logger = ctx.logger

    # Try loading from the checkpoint
    try:
        checkpoint = logger.load_checkpoint()
        start_epoch = checkpoint['epoch']+1
        model = checkpoint['model']
    except:
        start_epoch = 0
        model = Network(n_layers = cfg.model.num_layers)


    model = model.to(cfg.data.device)
    optimizer = torch.optim.SGD(model.parameters(),
                                lr=cfg.optimizer.lr)
    dataloader = DataLoader(cfg.data.d_int,
                            cfg.data.device)

    # Training
    print(f"Starting from epoch: {start_epoch} ")

    for epoch in range(start_epoch,cfg.num_epoch):

        train_err = train_epoch(dataloader,
                                model,
                                optimizer)

        logger.log_metrics({'loss': train_err.item(),
                            'epoch': epoch}, log_name='train')

        logger.log_checkpoint({'model': model,
                               'epoch':epoch}, log_name='last_ckpt')

    print(f"Completed training with learing rate: {cfg.optimizer.lr}")


if __name__ == "__main__":
    train()

Of course, if you execute main.py without further options, the logger will create a new log_id where there is no checkpoint yet, so it cannot resume a previous job. Instead, you need to force the log_id using the option logger.forced_log_id:

$ python main.py +mlxp.logger.forced_log_id=1
Starting from epoch 10
Completed training with learning rate: 1e-3

Customizing the parent log directory

You can change the parent directory by overriding the option +mlxp.logger.parent_log_dir from the command-line:

$ python main.py +mlxp.logger.parent_log_dir='./new_logs'

Alternatively, the parent directory can be modified directly in the MLXP default settings file configs/mlxp.yaml. This file is created automatically if it doesn’t exist already and contains all the defaults options for using MLXP in the current project:

./configs/mlxp.yaml
logger:
  ...
  parent_log_dir: ./logs
  ...