1- Launching¶
We will see how to modify the file main.py
to use MLXP using the decorator mlxp.launch
.
But first, let’s introduce the mlxp.Context
object produced by the decorator mlxp.launch
that allows using MLXP’s logging and configuring functionalities.
The Context object¶
MLXP uses an object ctx
of the class mlxp.Context
that is created on the fly during the execution of the program to store information about the run.
More precisely, it contains 4 fields:
ctx.config
: Stores project-specific options provided by the user. These options are loaded from a yaml file calledconfig.yaml
located in a directoryconfig_path
provided as input to the decoratormlxp.launch
.ctx.mlxp
: Stores MLXP’s default settings for the project. Its content is loaded from a yaml filemlxp.yaml
located in the same directoryconfig_path
.ctx.info
: Contains information about the current run: ex. status, start time, hostname, etc.ctx.logger
: A logger object that can be used in the code for logging variables (metrics, checkpoints, artifacts). When logging is enabled, these variables are all stored in a uniquely defined directory.
General setup¶
Defining a default config file¶
The first step is to provide all default options that will be used by the code in a separate Yaml file named config.yaml
and contained in the ./configs
directory.
seed: 0
num_epoch: 10
model:
num_units: 100
data:
d_int: 10
device: 'cpu'
optimizer:
lr: 10.
Here, we stored all options that were provided as input to the function train
in the main.py
file (such as the learning rate lr
, number of epochs num_epochs
, etc) into a structured Yaml file. The user has the freedom to define their own structure: for instance, here we chose to group the input dimension d_int
and device
into the same data
group, but other (probably better choices) are possible.
MLXP will load this file by default, just like in hydra and provide these options as a hierachical dictionary to be used in the code (more about this later!).
Adapting code for using MLXP¶
To use MLXP, we only need to slightly change the main.py
file.
The first step is to import MLXP and use the decorator mlxp.launch
above the function train
.
We also need to change the signature of the function train
so that it can accept an object ctx
of type mlxp.Context
as an argument instead of the variables.
Note, however, that train
is called later without explicitly passing any argument. The remaining modifications are:
Using the option values stored in
ctx.config
as a replacement to the variables provided in the older version of the code (See: the old ‘main.py’ file).Using the logger
ctx.logger
to store the results of the run (instead of printing them) and saving checkpoints.
Here is how the code would look like:
import torch
from core import DataLoader, OneHiddenLayer
import mlxp
@mlxp.launch(config_path='./configs')
def train(ctx: mlxp.Context)->None:
cfg = ctx.config
logger = ctx.logger
start_epoch = 0
# Building model, optimizer and data loader.
model = OneHiddenLayer(d_int=cfg.data.d_int,
n_units = cfg.model.num_units)
model = model.to(cfg.data.device)
optimizer = torch.optim.SGD(model.parameters(),
lr=cfg.optimizer.lr)
dataloader = DataLoader(cfg.data.d_int,
cfg.data.device)
# Training
for epoch in range(start_epoch,cfg.num_epoch):
train_err = train_epoch(dataloader,
model,
optimizer)
logger.log_metrics({'loss': train_err.item(),
'epoch': epoch}, log_name='train')
logger.log_checkpoint({'model': model,
'epoch':epoch}, log_name='last_ckpt' )
print(f"Completed training with learing rate: {cfg.optimizer.lr}")
if __name__ == "__main__":
train()
Seeding code using MLXP¶
In our example, the initialization of the model uses random initial parameters which might change from one run to another. To avoid this, the user can provide a function seeding_function
to the mlxp.launch
decorator to set the global seeds of whatever random number generator is used.
import mlxp
from core import DataLoader, Network, Optimizer, Loss
def seeding_function(seed):
import torch
torch.manual_seed(seed)
@mlxp.launch(config_path='./configs',
seeding_function=seeding_function)
def train(ctx: mlxp.Context)->None:
cfg = ctx.config
logger = ctx.logger
...
if __name__ == "__main__":
train()
The function seeding_function
will be called by MLXP before executing the function train
. The parameter seed is read from the user-defined option: ctx.config.seed
. If the field seed is not provided by the user and a seeding function is passed, then the code throws an error.
Note that the field seed
passed to the seeding_function
can be an integer or a dictionary or any object that can be stored in a yaml file.
Of course, it is also possible to perform seeding inside the function train
, but seeding_function
allows you to do it systematically.
Launching locally using MLXP¶
During execution, the default configurations will be read from the file config.yaml
located in the directory ./configs
and passed to the object ctx.config
. The code will be executed using these option:
$ python main.py
Completed training with learning rate: 10.0
Just like with hydra, we can run the code again with different options by overriding the default ones from the command line. For instance, we can use different learning rates and even select multiple values for it (say: 1e-2
and 1e-1
). we can do this from the command line by providing multiple values (0.01,0.1)
to the option optimizer.lr
:
$ python main.py optimizer.lr=0.01,0.1
Completed training with learning rate: 0.01
Completed training with learning rate: 0.1
In the above instruction, we added an option optimizer.lr=0.01,0.1
which execute the code twice: once using a learning rate of 0.01
and a second time using 0.1
.
Launching jobs to a scheduler¶
If you have access to an HPC cluster, then you probably use a job scheduler for submitting jobs. MLXP allows you to combine the ‘multirun’ capabilities of hydra with job scheduling to easily submit multiple experiments to a cluster. Currently, MLXP supports the following job schedulers: SLURM, OAR, TORQUE, SGE, MWM and LSF.
Submitting jobs to a job scheduler¶
Let’s say, you’d like to submit multiple jobs into a job scheduler. You can do this easily using the mlxpsub command!
The first step is to create a script ex.: script.sh
in your working directory (here under my_project/
).
In this script, you can define the resources allocated to your jobs, using the syntax of your job scheduler, as well as the python command for exectuting your main python script. You can then pass different option values to your python script main.py
as discussed earlier in
the launching tutorial:
#!/bin/bash
#OAR -l core=1, walltime=6:00:00
#OAR -t besteffort
#OAR -t idempotent
python main.py optimizer.lr=10.,1. seed=1,2
python main.py model.num_units=100,200 seed=1,2
The above script is meant to create and exectute 8 jobs in total that will be submitted to an OAR job scheduler. The first 4 jobs correspond to the first python command using all possible combinations of option values for optimizer.lr
and seed
: (10.,1)
, (10.,2)
, (1.,1)
, (1.,2)
. The 4 next jobs are for the second command wich varies the options model.num_units
and seed
.
You only need to run the following command in the terminal:
mlxpsub script.sh
What happens under the woods?¶
Here is what happens:
mlxpsub command parses the script to extract the scheduler’s instructions and figures out what scheduler is used, then provides those information as a context prior to executing the script.
hydra performs a cross-product of the options provided and creates as many jobs are needed.
The MLXP creates a separate directory for each one of these jobs. Each directory is assigned a unique log_id and contains a script to be submitted.
All generated scripts are submitted to the job scheduler.
What you should expect?¶
MLXP creates a script for each job corresponding to an option setting. Each script is located in a directory of the form
parent_log/log_id
, where log_id
is automatically assigned by MLXP for each job. Here is an example of the first created script in logs/1/script.sh
where the user sets parent_log
to logs
.
#!/bin/bash
#OAR -n logs/1
#OAR -E /root/logs/1/log.stderr
#OAR -O /root/logs/1/log.stdout
#OAR -l core=1, walltime=6:00:00
#OAR -t besteffort
#OAR -t idempotent
cd /root/workdir/
python main.py optimizer.lr=10. seed=1
As you can see, MLXP automatically assigns values for the job’s name, stdout
and stderr
file paths,
so there is no need to specify those in the original script script.sh
.
These scripts contain the same scheduler’s options
as in script.sh
in addition to a single python command specific to the option setting: optimizer.lr=10. seed=1
.
Additionally, MLXP pre-processes the python command to extract the working directory and sets it explicitly in the newly created script before the python command.
Once, the job finishes execution, we can double-check that everything went well by inspecting the directory logs/1/
which should contain the usual logs and two additional files log.stdout
and log.stderr
:
logs/
├── 1/
│ ├── metadata/
│ │ ├── config.yaml
│ │ ├── info.yaml
│ │ └── mlxp.yaml
│ ├── metrics/
│ │ ├── train.json
│ │ └── .keys/
│ │ └── metrics.yaml
│ ├── artifacts/
│ │ └── Checkpoint/
│ │ └── last_ckpt.pkl
│ ├── log.stderr
│ ├── log.stdout
│ └── script.sh
│
├──...