Data structures¶
These are the data structures provided by MLXP to handle configuration options and data
The schemas
Classes¶
Structures for validating the configurations.
- class mlxp.data_structures.schemas.ConfigVersionManager(name: str = '???')[source]¶
Bases:
object
Structure of the config file for the version manager.
- class mlxp.data_structures.schemas.ConfigGitVM(name: str = 'mlxp.GitVM', parent_work_dir: str = './.work_dir', compute_requirements: bool = False)[source]¶
Bases:
ConfigVersionManager
Configs for using the GitVM version manager.
It inherits the structure of the class VersionManager.
- class mlxp.data_structures.schemas.ConfigLogger(name: str = 'mlxp.DefaultLogger', parent_log_dir: str = './logs', forced_log_id: int = -1, log_streams_to_file: bool = False)[source]¶
Bases:
object
Structure of the config file for the logs.
The outputs for each run are saved in a directory of the form ‘parent_log_dir/log_id’ which is stored in the variable ‘path’ during execution.
- parent_log_dir: str¶
Absolute path of the parent directory where the logs of a run are stored. (default “./logs”)
- forced_log_id: int¶
An id optionally provided by the user for the run. If forced_log_id is positive, then the logs of the run will be stored under ‘parent_log_dir/forced_log_id’. Otherwise, the logs will be stored in a directory ‘parent_log_dir/log_id’ where ‘log_id’ is assigned uniquely for the run during execution.
- class mlxp.data_structures.schemas.Info(status: str = 'STARTING', current_file_path: str = '', executable: str = '', hostname: str = '', process_id: int = -1, start_date: Any = '', start_time: Any = '', end_date: Any = '', end_time: Any = '', work_dir: str = '/tmp/tmpork1so9r/a96cd66dd690399be0478e073810d9a43d58851c/docs', logger: Any | None = None, scheduler: Any | None = None, version_manager: Any | None = None)[source]¶
Bases:
object
A structure storing general information about the run.
The following variables are assigned during execution.
- status: str¶
Status of a job. The status can take the following values:
STARTING: The metadata for the run have been created.
RUNNING: The experiment is currently running.
COMPLETE: The run is complete and did not through any error.
FAILED: The run stoped due to an error.
- start_date: Any¶
Date at which job started.
- start_time: Any¶
Time at which job started.
- end_date: Any¶
Date at which job ended.
- end_time: Any¶
Time at which job ended.
- logger: Any¶
Logger info, whenever used.
- scheduler: Any¶
scheduler info, whenever used.
- version_manager: Any¶
version_manager info, whenever used.
- class mlxp.data_structures.schemas.MLXPConfig(logger: ~mlxp.data_structures.schemas.ConfigLogger = <factory>, version_manager: ~mlxp.data_structures.schemas.ConfigVersionManager = <factory>, use_version_manager: bool = False, use_scheduler: bool = False, use_logger: bool = True, interactive_mode: bool = True, resolve: bool = True, as_ConfigDict: bool = True)[source]¶
Bases:
object
Default settings of MLXP.
- logger: ConfigLogger¶
The logger’s settings. (default ConfigLogger)
- version_manager: ConfigVersionManager¶
The version_manager’s settings. (default ConfigGitVM)
- interactive_mode: bool¶
A variable controlling MLXP’s interactive mode.
If ‘interactive_mode==True’, MLXP uses the interactive mode whenever applicable:
When ‘use_version_manager==True’: Asks the user:
If untracked files should be added.
If uncommitted changes should be committed.
If a copy of the current repository based on the latest commit should be made (if not already existing) to execute the code from there. Otherwise, code is executed from the current directory.
If ‘interactive_mode==False’, no interactive mode is used and current options are used:
When ‘use_version_manager==True’:
Existing untracked files or uncommitted changes are ignored.
A copy of the code is made based on the latest commit (if not already existing) and code is executed from there.
- class mlxp.data_structures.schemas.Metadata(info: ~mlxp.data_structures.schemas.Info = <factory>, mlxp: ~mlxp.data_structures.schemas.MLXPConfig = <factory>, config: ~typing.Any | None = None)[source]¶
Bases:
object
The structure of the config file.
- info: Info¶
Contains config information of the run (hostname, command, application, etc) (default Info)
- mlxp: MLXPConfig¶
Default settings of MLXP. (default MLXPConfig)
- config: Any¶
Contains the user’s defined configs that are specific to the run.
The config_dict
Classes¶
A dictionary-like structure for storing the configurations.
- class mlxp.data_structures.config_dict.ConfigDict(*args, **kwargs)[source]¶
Bases:
dict
A subclass of the dict class containing the configuration options.
The value corresponding to a key can be accessed as an attribute: self.key
- mlxp.data_structures.config_dict.convert_dict(src_dict: ~typing.Any, src_class: ~typing.Type = <class 'omegaconf.dictconfig.DictConfig'>, dst_class: ~typing.Type = <class 'mlxp.data_structures.config_dict.ConfigDict'>) Any [source]¶
Convert a dictionary-like object from a source class to a destination dictionary- like object of a destination class.
- Parameters:
src_dict (Any) – The source dictionary to be converted
src_class (Type) – The type of the src dictionary
dst_class (Type) – The destination type of the returned dictionary-like object.
- Returns:
A dictionary-like instance of the dst_class copying the data from the src_dict.
- Return type:
Any
The Artifact
Class¶
Artifacts objects that can be saved by a Logger object.
The dataframe
Module¶
Data structures returned by Reader object.
- class mlxp.data_structures.dataframe.DataDict(flattened_dict, parent_dir=None)[source]¶
Bases:
Mapping
A dictionary of key values pairs where some values are loaded lazyly from a specific path whenever they are accessed.
- class mlxp.data_structures.dataframe.DataFrame(iterable: List[DataDict])[source]¶
Bases:
list
A list of elements of type DataDict.
This list can be viewed as a dataframe where each row represents a given entry of type DataDict and columns represent the keys of the DataDict objects. This structure allows to load some columns lazyly: the content of these columns is loaded from their corresponding file only when that column is explicitly accessed.
It is displayed as a pandas dataframe and can be converted to it using the method toPandas.
- diff(start_key: str = 'config') List[str] [source]¶
Return a list of colums keys starting with ‘start_key’ and whose value varies in the dataframe.
- toPandas(lazy: bool = True) DataFrame [source]¶
Convert the list into a pandas dataframe.
- Parameters:
lazy – If true the pandas dataframe does not contain the results of data loaded lazyly.
- Returns:
A panda dataframe containing logs (configs and data) of the DataFrame object
- Return type:
pd.DataFrame
- merge(new_df: DataFrame) DataFrame [source]¶
Merge a target dataframe to the current dataframe.
The target dataframe must have the same number of rows as the current one.
- Params new_df:
The target dataframe to merge
- Returns:
A new merged dataframe.
- Return type:
- keys() List[str] [source]¶
Return a list of column names of the dataframe.
- Returns:
List of strings containing the column names of the dataframe
- Return type:
List[str]
- groupby(group_keys: str | List[str]) GroupedDataFrame [source]¶
Perform a groupby operation on the dataframe according to a list of colum names (group_keys).
Returns an object of the class GroupedDataFrame
- Params group_keys:
A string or list of strings containing the names of the columns to be grouped.
- Returns:
A dictionary of dataframes grouped by the values of the columns provided to group_keys. Each key of the dictionary is a tuple of values taken by the columns in group_keys.
- Return type:
- Raises:
InvalidKeyError – if one of the provided keys does not match any column of the dataframe.
- aggregate(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]]) DataFrame [source]¶
Perform aggregation of of columns of a dataframe according to some aggregation maps and returns a new dataframe with a single row.
This function returns a DataFrame object with a single row containing the results of the aggregation maps.
- Params maps:
Either an element of type Map or a list of elements of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element is a Callable[[List[Any]], Union[Any,Tuple[Any,…]]] that must take a list of all values of a given column in the dataframe. It must reduce the list into a single element of arbitrary type which is stored as the value of a single output column in a dataframe with a single row. - The second element of the Map tuple represents the list of columns in the dataframe on which the map is applied columnwise. - The third element reprensents the optional name of the output columns.
- Returns:
A DataFrame object containing the result of the aggregation maps.
- Return type:
- Raises:
InvalidMapError – if the maps are not of type List[Map] or Map.
- apply(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], map_type='Generic') DataFrame [source]¶
Applies a generic map or list of maps to a dataframe.
This function returns a DataFrame object containing the results of applying the maps to the dataframe.
- Params maps:
Either an element of type Map or a list of elements of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element of a Map is a callable to be applied to the dataframe. - The second element of the Map represents the list of columns in the dataframe to provide as input to the callable. - The third element reprensents the optional name of the output columns.
- Params map_type:
Specifies the types of maps to be applied: ‘Generic’, ‘Columnwise’, ‘Rowwise’, ‘Pointwise’: - ‘Pointwise’: In this case, the apply method is equivalent to the method map. It applied the maps pointwise on each value corresponding to a row and selected column. - ‘Columnwise’: In this case, the apply method is equivalent to either tranform or aggregate method. It applies the maps columnwise and expects the output to either preserve the same number of rows as the initial dataframe (as the tranform method) or to reduce it to a single value (like the aggregate method). - ‘Rowwise’: Applied a map rowise. In that case, the apply method returns a dataframe with the same number of rows as the initial one. The signature of the callable (the first element of the tuple Map) must be Callable[[Union[Any,Tuple[Any,…]]], Any]. It takes the values of some specific columns at a single row and returns an output for that row. The column names on which the map operates are provided as the second element of the Map tuple. - ‘Generic’: Extends the transform and aggregate methods to support operations that are not columnwise. The input to the callable (the first element of the tuple Map) must be either List[Any] or Tuple[List[Any],…]. The callable must have the same return type as the callables used in a transform or aggregate methods: either Union[Any,Tuple[Any,…]] or Union[List[Any],Tuple[List[Any],…]]. It takes lists of values of some specific columns applies the map to them and returns transformed outputs that can be either lists of values (as in the tranform method) or single values (as in the aggregate method).
- Returns:
A DataFrame object containing the result of the maps.
- Return type:
- Raises:
InvalidMapError – if the maps are not of type List[Map] or Map.
- transform(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]]) DataFrame [source]¶
Applies a map columnwise to a dataframe while preserving the number of rows.
This function returns a DataFrame object containing the results of the tranformation maps. The new dataframe has the same number of rows as the initial dataframe on which the transform is applied. This method extends the map method to support operation that are not pointwise and can depend on values from different rows of the same column.
- Params maps:
Either an element of type Map or a list of elements of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element is a Callable[[List[Any]], Union[List[Any],Tuple[List[Any],…]]] that must take a list of all values of a given column in the dataframe. It must return a list or a tuple of lists of elements of arbitrary types. The size of the returned lists must be the same as the input list. Each element of the returned lists corresponds to a transformation of the value at a given row and columns of the original dataframe. - The second element of the Map tuple represents the list of columns in the dataframe on which the map is applied columnwise. - The third element reprensents the optional name of the output columns.
- Returns:
A DataFrame object containing the result of the columnwise maps.
- Return type:
- Raises:
InvalidMapError – if the maps are not of type List[Map] or Map.
- map(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]]) DataFrame [source]¶
Applies a map pointwise to each value corresponsing to specified columns of the dataframe.
This function returns a DataFrame object containing the results of the pointwise maps. The new dataframe has the same number of rows as the initial dataframe on which the transform is applied. Each row is processed independtly from the others.
- Params maps:
Either an element of type Map or a list of elements of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element is a Callable[[Any], Any] that must take a value corresponding to a given row and column in the dataframe and tranforms it. - The second element of the Map tuple represents the list of columns in the dataframe on which the map is applied columnwise. - The third element reprensents the optional name of the output columns.
- Returns:
A DataFrame object containing the result of the pointwise map.
- Return type:
- Raises:
InvalidMapError – if the maps are not of type List[Map] or Map.
- filter(filter_map: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None], bygroups: None | str | List[str] = None) DataFrame [source]¶
Returns a new dataframe containing a subset of rows of the initial dataframe that pass a given filter.
- Params filter_map:
An element of type Map. A Map is a tuple with signature Tuple[Callable, Tuple[str, …], Optional[Tuple[str, …]]]. - The first element of the filter map is a function with signature Callable[[Union[List[Any], Tuple[List[Any],…]]], List[Any]] that can take a list or a tuple of lists. Each input list contains all values of some columns of the dataframe defined in the second element of the Map tuple. The filter map must return a list of booleans of the same size as the initial lists, each boolean value corresponding to an element of the initial lists at the same location. Only rows of the dataframe for whicht the returned boolean value is true pass the filter. - The second element of the Map tuple represents the list of columns that the filter map takes as input. - The third element of the Map tuple is never used.
- Params bygroups:
Optionally apply the filter by groups when bygroups is either a column name or list of column names by which the dataframe must be grouped. Once the filter is applied by group, the groups are merged together into a single ungrouped dataframe. This is equivalent to performing self.groupby(bygroups).filter(filter_map).ungroup()
- Returns:
A DataFrame object containing a filtered version of the initial dataframe.
- Return type:
- Raises:
InvalidMapError – if the filter map are not of type List[Map] or Map.
- sort(by: str | List[str], ascending: bool = True) DataFrame [source]¶
Returns a sorted dataframe according to a list of columns.
- Params by:
Either column name or a list of column names by which the dataframe must be sorted with.
- Params ascending:
Sorting either by increasing values (ascending=True) or descreasing values (ascending=False) of the specified columns.
- Returns:
A sorted DataFrame object.
- Return type:
- class mlxp.data_structures.dataframe.GroupedDataFrame(group_keys: List[str], grouped_dict: Dict[Tuple[str, ...], GroupedDataFrame | DataFrame])[source]¶
Bases:
object
A dictionary where each key represents the tuple of values taken by the grouped column of some processed dataframe.
The values corresponsing to each key are objects of type DataFrame containing a group. This object is usually obtained as the output of the group_by method of the class DataFrame. It is displayed as a hierarchical pandas dataframe and can be converted to it using toPandas method.
Note
It is possible to directly access the keys and values of self.grouped_dict by identifying self with self.grouped_dict:
Using self[key] instead of self.grouped_dict[key] to access the value of self.grouped_dict at a given key
Using self.keys() to get all keys of self.grouped_dict.
Using self.items() to iterate over the key/value pairs of self.grouped_dict.
- items() ItemsView [source]¶
Return the items of the grouped dictionary.
- Returns:
items of the dictionary
- Return type:
ItemsView
- keys() KeysView [source]¶
Return the keys of the grouped dictionary.
- Returns:
keys of the dictionary
- Return type:
KeysView
- ungroup() DataFrame [source]¶
Concatenates the dataframes representing each group into a single dataframe. The group keys are added as columns to the resulting dataframe.
- Returns:
A dataframe representing the ungrouped version of the groupped dictionary.
- Return type:
- toPandas(lazy=True) DataFrame [source]¶
Convert. the list into a pandas dataframe.
- Param:
If true the pandas dataframe does not contain the results of data loaded lazyly.
- Returns:
A panda dataframe containing logs (configs and data) of the DataFrame object
- Return type:
pd.DataFrame
- apply(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], map_type='Generic', ungroup: bool = False) GroupedDataFrame [source]¶
Applies a generic tranformation to each dataframe representing each group. see DataDictsList.apply. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.
- Params maps:
Either a single instance of tuple Map or a list of tuple of type Map.
- Params map_type:
Type of the transformation to apply (see DataDictsList.apply): ‘Generic’, ‘Columnwise’, ‘Rowwise’ or ‘Pointwise’.
- Params ungroup:
Optionally returns a ungroupped version of the result.
- Returns:
An object containing the result of the applied transformations.
- Return type:
Union[GroupedDataFrame,DataFrame]
- Raises:
InvalidMapError – if the maps are not of type List[Map] or Map.
- aggregate(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], ungroup: bool = False) GroupedDataFrame | DataFrame [source]¶
Perform aggregation of the dataframe corresponding to each group. see DataDictsList.aggregate. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.
- Params maps:
Either a single instance of tuple Map or a list of tuple of type Map.
- Params ungroup:
Optionally returns a ungroupped version of the result.
- Returns:
An object containing the result of the aggregation.
- Return type:
Union[GroupedDataFrame,DataFrame]
- Raises:
InvalidMapError – if the maps are not of type List[Map] or Map.
- filter(filter_map: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None], bygroups: None | str | List[str] = None, ungroup: bool = False) GroupedDataFrame | DataFrame [source]¶
Filters the dataframe of each group. see DataDictsList.filter. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.
- Params filter_map:
An instance of tuple Map.
- Params ungroup:
Optionally returns a ungroupped version of the result.
- Params bygroups:
Optionally apply the filter by groups when bygroups is either a column name or list of column names by which the dataframe must be grouped. Once the filter is applied by group, the groups are merged together into a single ungrouped dataframe. This is equivalent to performing self.groupby(bygroups).filter(filter_map).ungroup()
- Returns:
An object containing the result of the filtering.
- Return type:
Union[GroupedDataFrame,DataFrame]
- Raises:
InvalidMapError – if the map is not of type Map.
- transform(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], ungroup: bool = False) GroupedDataFrame | DataFrame [source]¶
Applies a columnwise tranformation to the dataframe corresponding to each group. see DataDictsList.transform. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.
- Params maps:
Either a single instance of tuple Map or a list of tuple of type Map.
- Params ungroup:
Optionally returns a ungroupped version of the result.
- Returns:
An object containing the result of the transformation.
- Return type:
Union[GroupedDataFrame,DataFrame]
- Raises:
InvalidMapError – if the maps are not of type List[Map] or Map.
- map(maps: Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None] | List[Tuple[Callable, Tuple[str, ...], Tuple[str, ...] | None]], ungroup: bool = False) GroupedDataFrame | DataFrame [source]¶
Applies a pointwise tranformation to the dataframe corresponding to each group. see DataDictsList.map. Returns a groupped dataframe of type GroupedDataFrame that is optionally ungroupped into a dataframe object of type DataFrame.
- Params maps:
Either a single instance of tuple Map or a list of tuple of type Map.
- Params ungroup:
Optionally returns a ungroupped version of the result.
- Returns:
An object containing the result of the pointwise transformation.
- Return type:
Union[GroupedDataFrame,DataFrame]
- Raises:
InvalidMapError – if the maps are not of type List[Map] or Map.
- select(key_list: List[Tuple[str, ...]]) GroupedDataFrame [source]¶
Extracts subgroups from a group object that corresponding to specific values of the group keys.
- Params key_list:
List of keys representing the groups to select.
- Returns:
An object containing the selected groups
- Return type:
- sort(by: List[str], ascending: bool = True, ungroup: bool = False) GroupedDataFrame | DataFrame [source]¶
Returns a grouped object where the dataframe of each group is sorted according to a list of columns. (see DataFrame.sort) optionally ungroupps the grouped object into a single dataframe of type DataFrame.
- Params by:
Either column name or a list of column names by which the dataframe must be sorted with.
- Params ascending:
Sorting either by increasing values (ascending=True) or descreasing values (ascending=False) of the specified columns.
- Params ungroup:
Optionally returns a ungroupped version of the result.
- Returns:
A sorted object.
- Return type:
Union[GroupedDataFrame,DataFrame]