.. _logging: 2- Logging ---------- Logging metrics and artifacts ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The logger provides four methods for logging objects: - :samp:`log_metrics`: For logging dictionaries of scalars in a json file. This method can be used to log the loss and other scalar quantities that can evolve during the run. These dictionaries are stored in a json file. - :samp:`log_artifacts`: For logging more complex objects such as the weights of a network, etc. This method requires passing the desired artifact format (ex: pickle, image, torch checkpoint) and the name of the artifact. - :samp: `load_artifacts`: For lading artifacts. - :samp:`log_checkpoint`: A simpler method for logging serializable objects as a pickle file. - :samp:`load_checkpoint`: A method for loading a saved checkpoint. Logging metrics """"""""""""""" In the :samp:`main.py` file, we have added a new line to log the loss at each epoch using the method :samp:`log_metrics`:. This methods takes a dictionary of scalars as inputs as well as the name of the JSON file where it will be stored: .. code-block:: python :caption: main.py ... logger.log_metrics({'loss': train_err.item(), 'epoch': epoch}, log_name='train') ... .. note:: Several dictonaries can be stored successively in the same JSON file even if they do not have the same keys. .. warning:: The keys of the dictionary must be consistent accross runs and within each JSON file. Logging artifacts """"""""""""""""" We also added a line to log the model's weights using the method :samp:`log_checkpoint`. This method is used to log serializable objects as pickle files. The name of the pickle file is provided as an argument. .. code-block:: python :caption: main.py ... logger.log_checkpoint({'model': model, 'epoch':epoch}, log_name='last_ckpt') ... For more general artifact, you can use the method :samp:`log_artifacts` which takes the artifact format and the name of the artifact as arguments. For instance, below we log the model's weights as a torch checkpoint: .. code-block:: python :caption: main.py ... logger.log_artifacts({'model': model, 'epoch':epoch}, artifact_name='last_ckpt', artifact_format='torch') ... The method :samp:`log_artifacts` natively supports the following types: :samp:`pickle`, :samp:`torch`, :samp:`image`, :samp:`numpy`. Registering custom artifacts """""""""""""""""""""""""""" In case other non supported artifacts need to be logged, the user can register custom artifact types. This is done using the method :samp:`register_artifact_type` which takes three arguments: the name of the artifact type, the method for saving the artifact, and the method for loading the artifact: .. code-block:: python :caption: main.py ... def save(obj,path): import pickle with open(path, 'wb') as f: pickle.dump(obj, f) def load(path): import pickle with open(path, 'rb') as f: return pickle.load(f) logger.register_artifact_type('my_pickle', save, load) ... logger.log_artifacts({'model': model, 'epoch':epoch}, artifact_name='last_ckpt', artifact_format='my_pickle') The method :samp:`register_artifact_type` must be called before the method :samp:`log_artifacts` is used with the new type. File structure of the logs ^^^^^^^^^^^^^^^^^^^^^^^^^^ When the logger is activated, it stores the results of a run in a sub-directory of the parent directory :samp:`./logs`. This parent directory is created automatically if it does not exists already. By default it is set to :samp:`./logs`, but this behavior can be modified (see :ref:`Customizing the parent log directory `). First, the logger assigns a :samp:`log_id` to the run. Every time :samp:`main.py` is executed with an active logger, the :samp:`log_id` of the new run is incremented by 1 starting from 1. Then a new sub-directory of :samp:`./logs` is created and named after the assigned :samp:`log_id`. Since we executed the code three times in total, we should expect three sub-directories under :samp:`./logs` called :samp:`1`, :samp:`2` and :samp:`3`, all having the same structure: .. code-block:: text :caption: ./logs/ logs/ ├── 1/... ├── 2/... └── 3/... Each log directory contains three sub-directories: :samp:`metadata`, :samp:`metrics` and :samp:`artifacts`: .. code-block:: text :caption: ./logs/ logs/ ├── 1/ │ ├── metadata/ │ │ ├── config.yaml │ │ ├── info.yaml │ │ └── mlxp.yaml │ ├── metrics/ │ │ ├── train.json │ │ └──.keys/ │ │ └── metrics.yaml │ └── artifacts/ │ └── pickle/ │ └── last_ckpt.pkl │ ├── 2/... └── 3/... Let's go through these three directories. The :samp:`metrics` directory """"""""""""""""""""""""""""" This directory contains JSON files created when calling the logger's method :samp:`log_metrics(dict, log_name)`. Each file is named after the variable :samp:`log_name` and stores the dictionaries provided as input to the :samp:`log_metrics` method. .. code-block:: json :caption: ./logs/1/metrics/train.json {"loss": 0.030253788456320763, "epoch": 0} {"loss": 0.02899891696870327, "epoch": 1} {"loss": 0.026649776846170425, "epoch": 2} {"loss": 0.023483652621507645, "epoch": 3} {"loss": 0.019827445968985558, "epoch": 4} {"loss": 0.01599641889333725, "epoch": 5} {"loss": 0.012259905226528645, "epoch": 6} {"loss": 0.008839688263833523, "epoch": 7} {"loss": 0.005932427477091551, "epoch": 8} {"loss": 0.003738593542948365, "epoch": 9} The hidden directory :samp:`.keys` is used by the reader module of MLXP and is not something to worry about here. Instead, we inspect the remaining directories below. The :samp:`metadata` directory """""""""""""""""""""""""""""" The :samp:`metadata` directory contains two yaml files: :samp:`config` and :samp:`info``, each storing the content of the corresponding fields of the context object :samp:`ctx`. :samp:`config` stores the user config of the run, :samp:`info` stores general information about the run such as the assigned :samp:`log_id` and the absolute path to the logs of the run :samp:`log_dir`. .. code-block:: yaml :caption: ./logs/1/metadata/config.yaml seed: 0 num_epoch: 10 model: num_units: 100 data: d_int: 10 device: 'cpu' optimizer: lr: 10. .. code-block:: yaml :caption: ./logs/1/metadata/info.yaml executable: absolute_path_to/bin/python cmd: '' end_date: 20/04/2023 end_time: '16:01:13' current_file_path: absolute_path_to/main.py log_dir: absolute_path_to/logs/1 log_id: 1 process_id: 7100 start_date: 20/04/2023 start_time: '16:01:13' status: COMPLETE user: marbel work_dir: absolute_path_to/tutorial The :samp:`artifacts` directory """"""""""""""""""""""""""""""" The directory :samp:`artifacts` is where all data passed to the logger's methods :samp:`log_artifacts` and :samp:`log_checkpoint` are stored. These are stored in different directories depending on the artifact type. In this example, since we used the reserved method :samp:`log_checkpoint`, the logged data are considered as pickle objects, hence the sub-directory :samp:`pickle`. You can see that it contains the pickle file :samp:`last_ckpt.pkl` which is the name we provided when calling the method :samp:`log_checkpoint` in the :samp:`main.py` file. Checkpointing ^^^^^^^^^^^^^ Checkpointing can be particularly useful if you need to restart a job from its latest state without having to re-run it from scratch. To do this, you only need to slightly modify the function :samp:`train` to load the latest checkpoint by default: .. code-block:: python :caption: main.py import torch from core import DataLoader, OneHiddenLayer import mlxp @mlxp.launch(config_path='./configs') def train(ctx: mlxp.Context)->None: cfg = ctx.config logger = ctx.logger # Try loading from the checkpoint try: checkpoint = logger.load_checkpoint() start_epoch = checkpoint['epoch']+1 model = checkpoint['model'] except: start_epoch = 0 model = Network(n_layers = cfg.model.num_layers) model = model.to(cfg.data.device) optimizer = torch.optim.SGD(model.parameters(), lr=cfg.optimizer.lr) dataloader = DataLoader(cfg.data.d_int, cfg.data.device) # Training print(f"Starting from epoch: {start_epoch} ") for epoch in range(start_epoch,cfg.num_epoch): train_err = train_epoch(dataloader, model, optimizer) logger.log_metrics({'loss': train_err.item(), 'epoch': epoch}, log_name='train') logger.log_checkpoint({'model': model, 'epoch':epoch}, log_name='last_ckpt') print(f"Completed training with learing rate: {cfg.optimizer.lr}") if __name__ == "__main__": train() Of course, if you execute :samp:`main.py` without further options, the logger will create a new :samp:`log_id` where there is no checkpoint yet, so it cannot resume a previous job. Instead, you need to force the :samp:`log_id` using the option :samp:`logger.forced_log_id`: .. code-block:: console $ python main.py +mlxp.logger.forced_log_id=1 Starting from epoch 10 Completed training with learning rate: 1e-3 .. _custom_log_dir: Customizing the parent log directory ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can change the parent directory by overriding the option :samp:`+mlxp.logger.parent_log_dir` from the command-line: .. code-block:: console $ python main.py +mlxp.logger.parent_log_dir='./new_logs' Alternatively, the parent directory can be modified directly in the MLXP default settings file :samp:`configs/mlxp.yaml`. This file is created automatically if it doesn't exist already and contains all the defaults options for using MLXP in the current project: .. code-block:: yaml :caption: ./configs/mlxp.yaml logger: ... parent_log_dir: ./logs ...