3- Reading¶
We have already stored information about 3 runs so far. We can access this information easily using MLXP’s reader module, which allows querying results, grouping, and aggregating them. Let’s do this interactively!
Creating a result database¶
We first start by creating a reader
objects that interacts with the logs of multiple runs contained in the same parent directory (here ./logs/
):
In [1]: import mlxp
In [2]: # Creates a database of results stored by the logger that is accessible using a reader object.
...: parent_log_dir = './logs/'
reader = mlxp.Reader(parent_log_dir)
Under the woods, the reader object creates a JSON file database.json
in the directory parent_log_dir
and stores metadata about all runs contained in that directory.
logs/
├── 1/...
├── 2/...
├── 3/...
└── database.json
This database allows, for instance, obtaining general information about the runs contained in the log directory parent_log_dir
, such as the number of runs or the list of fields that are stored in the various files of the log directories: (e.g. in config.yaml
, info.yaml
or metrics/
):
In [3]: # Displaying the number of runs accessible to the reader
...: len(reader)
Out[3]: 3
In [4]: # Displaying all fields accessible in the database.
...: print(reader.fields)
Out[4]:
Type
Fields
config.data.d_int '<class 'int'>'
config.data.device '<class 'str'>'
config.model.num_units '<class 'int'>'
config.num_epoch '<class 'int'>'
config.optimizer.lr '<class 'float'>'
config.seed '<class 'int'>'
info.app '<class 'str'>'
info.cmd '<class 'str'>'
info.end_date '<class 'str'>'
info.end_time '<class 'str'>'
info.exec '<class 'str'>'
info.hostname '<class 'str'>'
info.log_dir '<class 'str'>'
info.log_id '<class 'int'>'
info.process_id '<class 'int'>'
info.start_date '<class 'str'>'
info.start_time '<class 'str'>'
info.status '<class 'str'>'
info.user '<class 'str'>'
info.work_dir '<class 'str'>'
train.epoch 'LAZYDATA'
train.loss 'LAZYDATA'
For instance, the method fields
displace a table of existing fields along with their type.
You can see that all the user config options are preceded by the prefix config
.
The table also contains all fields stored in the files info.yaml
of the metadata directory for each run.
Finally, all keys stored by the logger when calling the method log_metrics
are also available.
Note that these keys are of type LAZYDATA
, meaning that the database does not store these data but only a reference to them (more on this later).
Querying the database¶
Once the database is created, the reader object allows filtering the database by the values taken by some of its fields. Not all fields can make a valid query. Only those obtained when displaying the attribute ‘searchable’ are acceptable:
In [5]: # Displaying searchable fields must start with info or config
...: print(reader.searchable)
Out[5]:
Type
Fields
config.data.d_int '<class 'int'>'
config.data.device '<class 'str'>'
config.model.num_units '<class 'int'>'
config.num_epoch '<class 'int'>'
config.optimizer.lr '<class 'float'>'
config.seed '<class 'int'>'
info.executable '<class 'str'>'
info.cmd '<class 'str'>'
info.end_date '<class 'str'>'
info.end_time '<class 'str'>' info.current_file_path '<class 'str'>'
info.hostname '<class 'str'>'
info.log_dir '<class 'str'>'
info.log_id '<class 'int'>'
info.process_id '<class 'int'>'
info.start_date '<class 'str'>'
info.start_time '<class 'str'>'
info.status '<class 'str'>'
info.user '<class 'str'>'
info.work_dir '<class 'str'>'
The searchable
fields must start with the prefixes: info.
or config.
to indicate that they correspond to keys in the files config.yaml
and info.yaml
of the directories metadata
of the logs. Let’s make a simple query and use the filter
method:
In [6]: # Searching using a query string
... query = "info.status == 'COMPLETE' & config.optimizer.lr <= 0.1"
... results = reader.filter(query_string=query, result_format="pandas")
In [7]: # Display the result as a pandas dataframe
...: results
Out[7]:
config.data.d_int ... train.loss
0 10 ... [0.030253788456320763, 0.03025251068174839, 0....
1 10 ... [0.030253788456320763, 0.03024102933704853, 0....
Here, we call the method filter
with the option result_format
set to pandas
. This allows to return the result as a pandas dataframe where the rows correspond to runs stored in the parent_log_dir
and matching the query. If the query is an empty string, then all entries of the database are returned.
The dataframe’s column names correspond to the fields contained in reader.fields
. These names are constructed as follows:
The dot-separated flattened keys of the hierarchical options contained in the YAML file
metadata.yaml
preceded by the prefixmetadata
.The keys of the dictionaries stored in the files contained in the
metrics
directories (heretrain.json
) preceded by the file name as a suffix (here:train.
).
As you can see, the dataframe loads the content of all keys in the files train.json
(contained in the metrics
directories of each run), which might not be desirable if these files are large.
This can be avoided using lazy evaluation which we describe next.
Lazy evaluation¶
Instead of returning the result of the search as a pandas dataframe, which loads all the content of the, possibly large, train.json
file, we can return a mlxp.DataDictList
object.
This object can also be rendered as a dataframe but does not load the train.json
files in memory unless the corresponding fields are explicitly accessed.
In [8]: # Returning a DataDictList as a result
... results = reader.filter(query_string=query)
In [9]: # Display the result as a pandas dataframe
...: results
Out[9]:
config.data.d_int config.data.device ... train.epoch train.loss
0 10 cpu ... LAZYDATA LAZYDATA
1 10 cpu ... LAZYDATA LAZYDATA
[2 rows x 39 columns]
As you can see, the content of the columns train.epoch
and train.loss
is simply marked as LAZYDATA
, meaning that it is not loaded for now. If we try to access a specific column (e.g. train.loss
), DataDictList
will automatically load the desired result:
In [10]: # Access a particular column of the results
...: results[0]['train.loss']
Out[10]:
[0.030253788456320763, 0.03025251068174839, 0.030249962583184242, 0.030246131122112274, 0.03024103306233883, 0.030234655365347862, 0.03022700361907482, 0.030218079686164856, 0.030207885429263115, 0.030196424573659897]
The object results should be viewed as a list of dictionaries. Each element of the list corresponds to a particular run in the parent_log_dir
directory. The keys of each dictionary in the list are the columns of the dataframe. Finally, it is always to convert a DataDictList
object to a pandas dataframe using the method toPandasDF
.
Grouping and aggregation¶
While it is possible to directly convert the results of a query to a pandas dataframe which supports grouping and aggregation operations, MLXP also provides basic support for these operations. Let’s see how this works:
In [11]: # List of group keys.
... group_keys = ['config.optimizer.lr']
In [12]: # Grouping the results
...: grouped_results = results.groupBy(group_keys)
...: print(grouped_results)
Out[12]:
config.data.d_int config.data.device ... train.epoch train.loss
config.optimizer.lr ...
0.01 10 cpu ... LAZYDATA LAZYDATA
0.10 10 cpu ... LAZYDATA LAZYDATA
[2 rows x 38 columns]
The output is an object of type GroupedDataDicts
. It can be viewed as a dictionary whose keys are given by the different values taken by the group variables. Here the group variable is the learning rate config.optimizer.lr
which takes the values 0.01
and 0.10
. Hence, the keys of GroupedDataDicts
are 0.01
and 0.10
. Each group (for instance the group with key 0.01
) is a DataDictList
object containing the different runs belonging to that group.
Finally, we can aggregate these groups according to some aggregation operations:
In [13]: # Creating the aggregation maps
... from mlxp.data_structures.contrib.aggregation_maps import AvgStd
... agg_maps = [AvgStd('train.loss'),AvgStd('train.epoch')]
In [14]: # Aggregating the results
...: agg_results = grouped_results.aggregate(agg_maps)
...: print(agg_results)
Out[14]:
train.loss_avg ... config.optimizer.lr
0 [0.030253788456320763, 0.03024102933704853, 0.... ... 0.1
1 [0.030253788456320763, 0.03025251068174839, 0.... ... 0.01
[2 rows x 3 columns]
Here, we compute the average and standard deviation of the field train.loss
which contains a list of loss values. The loss values are averaged per group and the result is returned as a DataDictList
object whose columns consist of:
The resulting fields:
train.loss_avg
andtrain.loss_std
The original group key:
config.optimizer.lr
.
Of course, one can always convert these structures to a pandas dataframe at any time!