Data structures

These are the data structures provided by MLXP to handle configuration options and data

The schemas Classes

Structures for validating the configurations.

class mlxp.data_structures.schemas.ConfigVersionManager(name: str = '???')[source]

Bases: object

Structure of the config file for the version manager.

name: str

Name of the version manager’s class.

class mlxp.data_structures.schemas.ConfigGitVM(name: str = 'mlxp.GitVM', parent_work_dir: str = './.work_dir', compute_requirements: bool = False)[source]

Bases: ConfigVersionManager

Configs for using the GitVM version manager.

It inherits the structure of the class VersionManager.

name: str

Name of the version manager’s class.

parent_work_dir: str

The target parent directory of the new working directory returned by the version manager

compute_requirements: bool

When set to true, the version manager stores a list of requirements and their version.

class mlxp.data_structures.schemas.ConfigLogger(name: str = 'mlxp.DefaultLogger', parent_log_dir: str = './logs', forced_log_id: int = -1, log_streams_to_file: bool = False)[source]

Bases: object

Structure of the config file for the logs.

The outputs for each run are saved in a directory of the form ‘parent_log_dir/log_id’ which is stored in the variable ‘path’ during execution.

name: str

Class name of the logger to use (default “DefaultLogger”)

parent_log_dir: str

Absolute path of the parent directory where the logs of a run are stored. (default “./logs”)

forced_log_id: int

An id optionally provided by the user for the run. If forced_log_id is positive, then the logs of the run will be stored under ‘parent_log_dir/forced_log_id’. Otherwise, the logs will be stored in a directory ‘parent_log_dir/log_id’ where ‘log_id’ is assigned uniquely for the run during execution.

log_streams_to_file: bool

If true logs the system stdout and stderr of a run to a file named “log.stdour” and “log.stderr” in the log directory.

class mlxp.data_structures.schemas.Info(status: str = 'STARTING', current_file_path: str = '', executable: str = '', hostname: str = '', process_id: int = -1, start_date: Any = '', start_time: Any = '', end_date: Any = '', end_time: Any = '', work_dir: str = '/tmp/tmpork1so9r/a96cd66dd690399be0478e073810d9a43d58851c/docs', logger: Any | None = None, scheduler: Any | None = None, version_manager: Any | None = None)[source]

Bases: object

A structure storing general information about the run.

The following variables are assigned during execution.

status: str

Status of a job. The status can take the following values:

  • STARTING: The metadata for the run have been created.

  • RUNNING: The experiment is currently running.

  • COMPLETE: The run is complete and did not through any error.

  • FAILED: The run stoped due to an error.

current_file_path: str

Name of the python file being executed.

executable: str

Path to the python executable used for executing the code.

hostname: str

Name of the host from which code is executed.

process_id: int

Id of the process assigned to the job during execution.

start_date: Any

Date at which job started.

start_time: Any

Time at which job started.

end_date: Any

Date at which job ended.

end_time: Any

Time at which job ended.

logger: Any

Logger info, whenever used.

scheduler: Any

scheduler info, whenever used.

version_manager: Any

version_manager info, whenever used.

class mlxp.data_structures.schemas.MLXPConfig(logger: ~mlxp.data_structures.schemas.ConfigLogger = <factory>, version_manager: ~mlxp.data_structures.schemas.ConfigVersionManager = <factory>, use_version_manager: bool = False, use_scheduler: bool = False, use_logger: bool = True, interactive_mode: bool = True, resolve: bool = True, as_ConfigDict: bool = True)[source]

Bases: object

Default settings of MLXP.

logger: ConfigLogger

The logger’s settings. (default ConfigLogger)

version_manager: ConfigVersionManager

The version_manager’s settings. (default ConfigGitVM)

use_version_manager: bool

If true, uses the version manager. (default False)

use_scheduler: bool

If true, uses the scheduler. (default False)

use_logger: bool

If true, uses the logger. (default True)

interactive_mode: bool

A variable controlling MLXP’s interactive mode.

  1. If ‘interactive_mode==True’, MLXP uses the interactive mode whenever applicable:

    • When ‘use_version_manager==True’: Asks the user:

      • If untracked files should be added.

      • If uncommitted changes should be committed.

      • If a copy of the current repository based on the latest commit should be made (if not already existing) to execute the code from there. Otherwise, code is executed from the current directory.

  2. If ‘interactive_mode==False’, no interactive mode is used and current options are used:

    • When ‘use_version_manager==True’:

      • Existing untracked files or uncommitted changes are ignored.

      • A copy of the code is made based on the latest commit (if not already existing) and code is executed from there.

resolve: bool

If true, resolves the configurations prior to stating the job (default True)

as_ConfigDict: bool

If true, converts the configurations from an omegaconf.dictconfig.DictConfig object to the custom mlxp.ConfigDict object. Once converted, the object becomes mutable and all its values are resolved. (default True)

class mlxp.data_structures.schemas.Metadata(info: ~mlxp.data_structures.schemas.Info = <factory>, mlxp: ~mlxp.data_structures.schemas.MLXPConfig = <factory>, config: ~typing.Any | None = None)[source]

Bases: object

The structure of the config file.

info: Info

Contains config information of the run (hostname, command, application, etc) (default Info)

mlxp: MLXPConfig

Default settings of MLXP. (default MLXPConfig)

config: Any

Contains the user’s defined configs that are specific to the run.

The config_dict Classes

A dictionary-like structure for storing the configurations.

class mlxp.data_structures.config_dict.ConfigDict(*args, **kwargs)[source]

Bases: dict

A subclass of the dict class containing the configuration options.

The value corresponding to a key can be accessed as an attribute: self.key

to_dict() Dict[str, Any][source]

Convert the object into a simple dictionary.

Returns:

A dictionary containing the same information as self

Return type:

Dict[str,Any]

update(new_dict: Dict[str, Any]) None[source]

Update the dictionary based on an input dictionary-like object.

Parameters:

new_dict (Dict[str, Any]) – Dictionary-like object.

mlxp.data_structures.config_dict.convert_dict(src_dict: ~typing.Any, src_class: ~typing.Type = <class 'omegaconf.dictconfig.DictConfig'>, dst_class: ~typing.Type = <class 'mlxp.data_structures.config_dict.ConfigDict'>) Any[source]

Convert a dictionary-like object from a source class to a destination dictionary- like object of a destination class.

Parameters:
  • src_dict (Any) – The source dictionary to be converted

  • src_class (Type) – The type of the src dictionary

  • dst_class (Type) – The destination type of the returned dictionary-like object.

Returns:

A dictionary-like instance of the dst_class copying the data from the src_dict.

Return type:

Any

The Artifact Class

Artifacts objects that can be saved by a Logger object.

The dataframe Module

Data structures returned by Reader object.

class mlxp.data_structures.dataframe.DataDict(flattened_dict, parent_dir=None)[source]

Bases: Mapping

A dictionary of key values pairs where some values are loaded lazyly from a specific path whenever they are accessed.

keys()[source]

Return keys of the dictionary.

items()[source]

Return items of the dictionary.

update(new_dict)[source]

Update the dictionary with values from another dictionary.

class mlxp.data_structures.dataframe.DataFrame(iterable: List[DataDict])[source]

Bases: list

A list of elements of type DataDict.

This list can be viewed as a dataframe where each row represents a given entry of type DataDict and columns represent the keys of the DataDict objects. This structure allows to load some columns lazyly: the content of these columns is loaded from their corresponding file only when that column is explicitly accessed.

It is displayed as a pandas dataframe and can be converted to it using the method toPandas.

diff(start_key: str = 'config') List[str][source]

Return a list of colums keys starting with ‘start_key’ and whose value varies in the dataframe.

Parameters:

start_key (str (default 'config')) – A string with which all column names to be considered must start.

Returns:

A list of strings containing the column names starting with ‘start_key’ and whose values vary in the dataframe.

Return type:

List[str]

toPandas(lazy: bool = True) DataFrame[source]

Convert the list into a pandas dataframe.

Parameters:

lazy – If true the pandas dataframe does not contain the results of data loaded lazyly.

Returns:

A panda dataframe containing logs (configs and data) of the DataFrame object

Return type:

pd.DataFrame

merge(new_df: DataFrame) DataFrame[source]

Merge a target dataframe to the current dataframe.

The target dataframe must have the same number of rows as the current one.

Params new_df:

The target dataframe to merge

Returns:

A new merged dataframe.

Return type:

DataFrame

keys() List[str][source]

Return a list of column names of the dataframe.

Returns:

List of strings containing the column names of the dataframe

Return type:

List[str]

groupby(group_keys: str | List[str]) GroupedDataFrame[source]

Perform a groupby operation on the dataframe according to a list of colum names (group_keys).

Returns an object of the class GroupedDataFrame

Params group_keys:

A string or list of strings containing the names of the columns to be grouped.

Returns:

A dictionary of dataframes grouped by the values of the columns provided to group_keys. Each key of the dictionary is a tuple of values taken by the columns in group_keys.

Return type:

GroupedDataFrame

Raises:

InvalidKeyError – if one of the provided keys does not match any column of the dataframe.

aggregate(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]]) DataFrame[source]

Perform aggregation of of columns of a dataframe according to some aggregation maps and returns a new dataframe with a single row.

This function returns a DataFrame object with a single row containing the results of the aggregation maps.

Params maps:

Either an element of type Map or a list of elements of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element is a Callable[[List[Any]], Union[Any,Tuple[Any,…]]] that must take a list of all values of a given column in the dataframe. It must reduce the list into a single element of arbitrary type which is stored as the value of a single output column in a dataframe with a single row. - The second element of the Map tuple represents the list of columns in the dataframe on which the map is applied columnwise. - The third element reprensents the optional name of the output columns.

Returns:

A DataFrame object containing the result of the aggregation maps.

Return type:

DataFrame

Raises:

InvalidMapError – if the maps are not of type List[Map] or Map.

apply(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], map_type='Generic') DataFrame[source]

Applies a generic map or list of maps to a dataframe.

This function returns a DataFrame object containing the results of applying the maps to the dataframe.

Params maps:

Either an element of type Map or a list of elements of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element of a Map is a callable to be applied to the dataframe. - The second element of the Map represents the list of columns in the dataframe to provide as input to the callable. - The third element reprensents the optional name of the output columns.

Params map_type:

Specifies the types of maps to be applied: ‘Generic’, ‘Columnwise’, ‘Rowwise’, ‘Pointwise’: - ‘Pointwise’: In this case, the apply method is equivalent to the method map. It applied the maps pointwise on each value corresponding to a row and selected column. - ‘Columnwise’: In this case, the apply method is equivalent to either tranform or aggregate method. It applies the maps columnwise and expects the output to either preserve the same number of rows as the initial dataframe (as the tranform method) or to reduce it to a single value (like the aggregate method). - ‘Rowwise’: Applied a map rowise. In that case, the apply method returns a dataframe with the same number of rows as the initial one. The signature of the callable (the first element of the tuple Map) must be Callable[[Union[Any,Tuple[Any,…]]], Any]. It takes the values of some specific columns at a single row and returns an output for that row. The column names on which the map operates are provided as the second element of the Map tuple. - ‘Generic’: Extends the transform and aggregate methods to support operations that are not columnwise. The input to the callable (the first element of the tuple Map) must be either List[Any] or Tuple[List[Any],…]. The callable must have the same return type as the callables used in a transform or aggregate methods: either Union[Any,Tuple[Any,…]] or Union[List[Any],Tuple[List[Any],…]]. It takes lists of values of some specific columns applies the map to them and returns transformed outputs that can be either lists of values (as in the tranform method) or single values (as in the aggregate method).

Returns:

A DataFrame object containing the result of the maps.

Return type:

DataFrame

Raises:

InvalidMapError – if the maps are not of type List[Map] or Map.

transform(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]]) DataFrame[source]

Applies a map columnwise to a dataframe while preserving the number of rows.

This function returns a DataFrame object containing the results of the tranformation maps. The new dataframe has the same number of rows as the initial dataframe on which the transform is applied. This method extends the map method to support operation that are not pointwise and can depend on values from different rows of the same column.

Params maps:

Either an element of type Map or a list of elements of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element is a Callable[[List[Any]], Union[List[Any],Tuple[List[Any],…]]] that must take a list of all values of a given column in the dataframe. It must return a list or a tuple of lists of elements of arbitrary types. The size of the returned lists must be the same as the input list. Each element of the returned lists corresponds to a transformation of the value at a given row and columns of the original dataframe. - The second element of the Map tuple represents the list of columns in the dataframe on which the map is applied columnwise. - The third element reprensents the optional name of the output columns.

Returns:

A DataFrame object containing the result of the columnwise maps.

Return type:

DataFrame

Raises:

InvalidMapError – if the maps are not of type List[Map] or Map.

map(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]]) DataFrame[source]

Applies a map pointwise to each value corresponsing to specified columns of the dataframe.

This function returns a DataFrame object containing the results of the pointwise maps. The new dataframe has the same number of rows as the initial dataframe on which the transform is applied. Each row is processed independtly from the others.

Params maps:

Either an element of type Map or a list of elements of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element is a Callable[[Any], Any] that must take a value corresponding to a given row and column in the dataframe and tranforms it. - The second element of the Map tuple represents the list of columns in the dataframe on which the map is applied columnwise. - The third element reprensents the optional name of the output columns.

Returns:

A DataFrame object containing the result of the pointwise map.

Return type:

DataFrame

Raises:

InvalidMapError – if the maps are not of type List[Map] or Map.

filter(filter_map: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None], bygroups: None | str | List[str] = None) DataFrame[source]

Returns a new dataframe containing a subset of rows of the initial dataframe that pass a given filter.

Params filter_map:

An element of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element of the filter map is a function with signature Callable[[Union[List[Any], Tuple[List[Any],…]]], List[Any]] that can take a list or a tuple of lists. Each input list contains all values of some columns of the dataframe defined in the second element of the Map tuple. The filter map must return a list of booleans of the same size as the initial lists, each boolean value corresponding to an element of the initial lists at the same location. Only rows of the dataframe for whicht the returned boolean value is true pass the filter. - The second element of the Map tuple represents the list of columns that the filter map takes as input. - The third element of the Map tuple is never used.

Params bygroups:

Optionally apply the filter by groups when bygroups is either a column name or list of column names by which the dataframe must be grouped. Once the filter is applied by group, the groups are merged together into a single ungrouped dataframe. This is equivalent to performing self.groupby(bygroups).filter(filter_map).ungroup()

Returns:

A DataFrame object containing a filtered version of the initial dataframe.

Return type:

DataFrame

Raises:

InvalidMapError – if the filter map are not of type List[Map] or Map.

sort(by: str | List[str], ascending: bool = True) DataFrame[source]

Returns a sorted dataframe according to a list of columns.

Params by:

Either column name or a list of column names by which the dataframe must be sorted with.

Params ascending:

Sorting either by increasing values (ascending=True) or descreasing values (ascending=False) of the specified columns.

Returns:

A sorted DataFrame object.

Return type:

DataFrame

class mlxp.data_structures.dataframe.GroupedDataFrame(group_keys: List[str], grouped_dict: Dict[Tuple[str, ...], GroupedDataFrame | DataFrame])[source]

Bases: object

A dictionary where each key represents the tuple of values taken by the grouped column of some processed dataframe.

The values corresponsing to each key are objects of type DataFrame containing a group. This object is usually obtained as the output of the group_by method of the class DataFrame. It is displayed as a hierarchical pandas dataframe and can be converted to it using toPandas method.

group_keys: List[str]

A list of string containing the column names used for grouping.

Note

It is possible to directly access the keys and values of self.grouped_dict by identifying self with self.grouped_dict:

  • Using self[key] instead of self.grouped_dict[key] to access the value of self.grouped_dict at a given key

  • Using self.keys() to get all keys of self.grouped_dict.

  • Using self.items() to iterate over the key/value pairs of self.grouped_dict.

items() ItemsView[source]

Return the items of the grouped dictionary.

Returns:

items of the dictionary

Return type:

ItemsView

keys() KeysView[source]

Return the keys of the grouped dictionary.

Returns:

keys of the dictionary

Return type:

KeysView

ungroup() DataFrame[source]

Concatenates the dataframes representing each group into a single dataframe. The group keys are added as columns to the resulting dataframe.

Returns:

A dataframe representing the ungrouped version of the groupped dictionary.

Return type:

DataFrame

toPandas(lazy=True) DataFrame[source]

Convert. the list into a pandas dataframe.

Param:

If true the pandas dataframe does not contain the results of data loaded lazyly.

Returns:

A panda dataframe containing logs (configs and data) of the DataFrame object

Return type:

pd.DataFrame

apply(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], map_type='Generic', ungroup: bool = False) GroupedDataFrame[source]

Applies a generic tranformation to each dataframe representing each group. see DataDictsList.apply. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.

Params maps:

Either a single instance of tuple Map or a list of tuple of type Map.

Params map_type:

Type of the transformation to apply (see DataDictsList.apply): ‘Generic’, ‘Columnwise’, ‘Rowwise’ or ‘Pointwise’.

Params ungroup:

Optionally returns a ungroupped version of the result.

Returns:

An object containing the result of the applied transformations.

Return type:

Union[GroupedDataFrame,DataFrame]

Raises:

InvalidMapError – if the maps are not of type List[Map] or Map.

aggregate(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], ungroup: bool = False) GroupedDataFrame | DataFrame[source]

Perform aggregation of the dataframe corresponding to each group. see DataDictsList.aggregate. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.

Params maps:

Either a single instance of tuple Map or a list of tuple of type Map.

Params ungroup:

Optionally returns a ungroupped version of the result.

Returns:

An object containing the result of the aggregation.

Return type:

Union[GroupedDataFrame,DataFrame]

Raises:

InvalidMapError – if the maps are not of type List[Map] or Map.

filter(filter_map: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None], bygroups: None | str | List[str] = None, ungroup: bool = False) GroupedDataFrame | DataFrame[source]

Filters the dataframe of each group. see DataDictsList.filter. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.

Params filter_map:

An instance of tuple Map.

Params ungroup:

Optionally returns a ungroupped version of the result.

Params bygroups:

Optionally apply the filter by groups when bygroups is either a column name or list of column names by which the dataframe must be grouped. Once the filter is applied by group, the groups are merged together into a single ungrouped dataframe. This is equivalent to performing self.groupby(bygroups).filter(filter_map).ungroup()

Returns:

An object containing the result of the filtering.

Return type:

Union[GroupedDataFrame,DataFrame]

Raises:

InvalidMapError – if the map is not of type Map.

transform(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], ungroup: bool = False) GroupedDataFrame | DataFrame[source]

Applies a columnwise tranformation to the dataframe corresponding to each group. see DataDictsList.transform. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.

Params maps:

Either a single instance of tuple Map or a list of tuple of type Map.

Params ungroup:

Optionally returns a ungroupped version of the result.

Returns:

An object containing the result of the transformation.

Return type:

Union[GroupedDataFrame,DataFrame]

Raises:

InvalidMapError – if the maps are not of type List[Map] or Map.

map(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], ungroup: bool = False) GroupedDataFrame | DataFrame[source]

Applies a pointwise tranformation to the dataframe corresponding to each group. see DataDictsList.map. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.

Params maps:

Either a single instance of tuple Map or a list of tuple of type Map.

Params ungroup:

Optionally returns a ungroupped version of the result.

Returns:

An object containing the result of the pointwise transformation.

Return type:

Union[GroupedDataFrame,DataFrame]

Raises:

InvalidMapError – if the maps are not of type List[Map] or Map.

select(key_list: List[Tuple[str, ...]]) GroupedDataFrame[source]

Extracts subgroups from a group object that corresponding to specific values of the group keys.

Params key_list:

List of keys representing the groups to select.

Returns:

An object containing the selected groups

Return type:

GroupedDataFrame

sort(by: List[str], ascending: bool = True, ungroup: bool = False) GroupedDataFrame | DataFrame[source]

Returns a grouped object where the dataframe of each group is sorted according to a list of columns. (see DataFrame.sort) optionally ungroupps the grouped object into a single dataframe of type DataFrame.

Params by:

Either column name or a list of column names by which the dataframe must be sorted with.

Params ascending:

Sorting either by increasing values (ascending=True) or descreasing values (ascending=False) of the specified columns.

Params ungroup:

Optionally returns a ungroupped version of the result.

Returns:

A sorted object.

Return type:

Union[GroupedDataFrame,DataFrame]