Data#
The data module of TopoBenchmarkX consists of several submodules:
datasets
load
preprocess
utils
Datasets#
Dataset class for US County Demographics dataset.
- class topobenchmarkx.data.datasets.us_county_demos_dataset.USCountyDemosDataset(root: str, name: str, parameters: DictConfig)[source]#
Dataset class for US County Demographics dataset.
- Parameters:
- rootstr
Root directory where the dataset will be saved.
- namestr
Name of the dataset.
- parametersDictConfig
Configuration parameters for the dataset.
- Attributes:
- URLS (dict): Dictionary containing the URLs for downloading the dataset.
- FILE_FORMAT (dict): Dictionary containing the file formats for the dataset.
- RAW_FILE_NAMES (dict): Dictionary containing the raw file names for the dataset.
- download() None [source]#
Download the dataset from a URL and saves it to the raw directory.
- Raises:
FileNotFoundError – If the dataset URL is not found.
- process() None [source]#
Handle the data for the dataset.
This method loads the US county demographics data, applies any pre- processing transformations if specified, and saves the processed data to the appropriate location.
- property processed_dir: str#
Return the path to the processed directory of the dataset.
- Returns:
- str
Path to the processed directory.
- property processed_file_names: str#
Return the processed file name for the dataset.
- Returns:
- str
Processed file name.
Load#
Abstract Loader class.
- class topobenchmarkx.data.loaders.base.AbstractLoader(parameters: DictConfig)[source]#
Abstract class that provides an interface to load data.
- Parameters:
- parametersDictConfig
Configuration parameters.
Data loaders.
- class topobenchmarkx.data.loaders.loaders.CellComplexLoader(parameters: DictConfig)[source]#
Loader for cell complex datasets.
- Parameters:
- parametersDictConfig
Configuration parameters.
- class topobenchmarkx.data.loaders.loaders.GraphLoader(parameters: DictConfig, **kwargs)[source]#
Loader for graph datasets.
- Parameters:
- parametersDictConfig
Configuration parameters.
- **kwargsdict
Additional keyword arguments.
Notes
The parameters must contain the following keys: - data_dir (str): The directory where the dataset is stored. - data_name (str): The name of the dataset. - data_type (str): The type of the dataset. - split_type (str): The type of split to be used. It can be “fixed”, “random”, or “k-fold”. If split_type is “random”, the parameters must also contain the following keys: - data_seed (int): The seed for the split. - data_split_dir (str): The directory where the split is stored. - train_prop (float): The proportion of the training set. If split_type is “k-fold”, the parameters must also contain the following keys: - data_split_dir (str): The directory where the split is stored. - k (int): The number of folds. - data_seed (int): The seed for the split. The parameters can be defined in a yaml file and then loaded using omegaconf.OmegaConf.load(‘path/to/dataset/config.yaml’).
- class topobenchmarkx.data.loaders.loaders.HypergraphLoader(parameters: DictConfig)[source]#
Loader for hypergraph datasets.
- Parameters:
- parametersDictConfig
Configuration parameters.
Preprocess#
Preprocessor for datasets.
- class topobenchmarkx.data.preprocessor.preprocessor.PreProcessor(dataset, data_dir, transforms_config=None, **kwargs)[source]#
Preprocessor for datasets.
- Parameters:
- datasetlist
List of data objects.
- data_dirstr
Path to the directory containing the data.
- transforms_configDictConfig, optional
Configuration parameters for the transforms (default: None).
- **kwargsoptional
Optional additional arguments.
- instantiate_pre_transform(data_dir, transforms_config) Compose [source]#
Instantiate the pre-transforms.
- Parameters:
- data_dirstr
Path to the directory containing the data.
- transforms_configDictConfig
Configuration parameters for the transforms.
- Returns:
- torch_geometric.transforms.Compose
Pre-transform object.
- load(path: str) None [source]#
Load the dataset from the file path path.
- Parameters:
- pathstr
The path to the processed data.
- load_dataset_splits(split_params) tuple[DataloadDataset, DataloadDataset | None, DataloadDataset | None] [source]#
Load the dataset splits.
- Parameters:
- split_paramsdict
Parameters for loading the dataset splits.
- Returns:
- tuple
A tuple containing the train, validation, and test datasets.
- property processed_dir: str#
Return the path to the processed directory.
- Returns:
- str
Path to the processed directory.
- property processed_file_names: str#
Return the name of the processed file.
- Returns:
- str
Name of the processed file.
- set_processed_data_dir(pre_transforms_dict, data_dir, transforms_config) None [source]#
Set the processed data directory.
- Parameters:
- pre_transforms_dictdict
Dictionary containing the pre-transforms.
- data_dirstr
Path to the directory containing the data.
- transforms_configDictConfig
Configuration parameters for the transforms.
Utils#
Data IO utilities.
- topobenchmarkx.data.utils.io_utils.download_file_from_drive(file_link, path_to_save, dataset_name, file_format='tar.gz')[source]#
Download a file from a Google Drive link and saves it to the specified path.
- Parameters:
- file_linkstr
The Google Drive link of the file to download.
- path_to_savestr
The path where the downloaded file will be saved.
- dataset_namestr
The name of the dataset.
- file_formatstr, optional
The format of the downloaded file. Defaults to “tar.gz”.
- Raises:
- None
- topobenchmarkx.data.utils.io_utils.get_file_id_from_url(url)[source]#
Extract the file ID from a Google Drive file URL.
- Parameters:
- urlstr
The Google Drive file URL.
- Returns:
- str
The file ID extracted from the URL.
- Raises:
- ValueError
If the provided URL is not a valid Google Drive file URL.
- topobenchmarkx.data.utils.io_utils.load_hypergraph_pickle_dataset(cfg)[source]#
Load hypergraph datasets from pickle files.
- Parameters:
- cfgDictConfig
Configuration parameters.
- Returns:
- torch_geometric.data.Data
Hypergraph dataset.
- topobenchmarkx.data.utils.io_utils.read_us_county_demos(path, year=2012, y_col='Election')[source]#
Load US County Demos dataset.
- Parameters:
- pathstr
Path to the dataset.
- yearint, optional
Year to load the features (default: 2012).
- y_colstr, optional
Column to use as label. Can be one of [‘Election’, ‘MedianIncome’, ‘MigraRate’, ‘BirthRate’, ‘DeathRate’, ‘BachelorRate’, ‘UnemploymentRate’] (default: “Election”).
- Returns:
- torch_geometric.data.Data
Data object of the graph for the US County Demos dataset.
Split utilities.
- topobenchmarkx.data.utils.split_utils.assing_train_val_test_mask_to_graphs(dataset, split_idx)[source]#
Split the graph dataset into train, validation, and test datasets.
- Parameters:
- datasettorch_geometric.data.Dataset
Considered dataset.
- split_idxdict
Dictionary containing the train, validation, and test indices.
- Returns:
- list:
List containing the train, validation, and test datasets.
- topobenchmarkx.data.utils.split_utils.k_fold_split(labels, parameters)[source]#
Return train and valid indices as in K-Fold Cross-Validation.
If the split already exists it loads it automatically, otherwise it creates the split file for the subsequent runs.
- Parameters:
- labelstorch.Tensor
Label tensor.
- parametersDictConfig
Configuration parameters.
- Returns:
- dict
Dictionary containing the train, validation and test indices, with keys “train”, “valid”, and “test”.
- topobenchmarkx.data.utils.split_utils.load_coauthorship_hypergraph_splits(data, parameters, train_prop=0.5)[source]#
Load the split generated by rand_train_test_idx function.
- Parameters:
- datatorch_geometric.data.Data
Graph dataset.
- parametersDictConfig
Configuration parameters.
- train_propfloat
Proportion of training data.
- Returns:
- torch_geometric.data.Data:
Graph dataset with the specified split.
- topobenchmarkx.data.utils.split_utils.load_inductive_splits(dataset, parameters)[source]#
Load multiple-graph datasets with the specified split.
- Parameters:
- datasettorch_geometric.data.Dataset
Graph dataset.
- parametersDictConfig
Configuration parameters.
- Returns:
- list:
List containing the train, validation, and test splits.
- topobenchmarkx.data.utils.split_utils.load_transductive_splits(dataset, parameters)[source]#
Load the graph dataset with the specified split.
- Parameters:
- datasettorch_geometric.data.Dataset
Graph dataset.
- parametersDictConfig
Configuration parameters.
- Returns:
- list:
List containing the train, validation, and test splits.
- topobenchmarkx.data.utils.split_utils.random_splitting(labels, parameters, global_data_seed=42)[source]#
Randomly splits label into train/valid/test splits.
Adapted from CUAI/Non-Homophily-Benchmarks.
- Parameters:
- labelstorch.Tensor
Label tensor.
- parametersDictConfig
Configuration parameter.
- global_data_seedint
Seed for the random number generator.
- Returns:
- dict:
Dictionary containing the train, validation and test indices with keys “train”, “valid”, and “test”.
Data utilities.
- topobenchmarkx.data.utils.utils.ensure_serializable(obj)[source]#
Ensure that the object is serializable.
- Parameters:
- objobject
Object to ensure serializability.
- Returns:
- object
Object that is serializable.
- topobenchmarkx.data.utils.utils.generate_zero_sparse_connectivity(m, n)[source]#
Generate a zero sparse connectivity matrix.
- Parameters:
- mint
Number of rows.
- nint
Number of columns.
- Returns:
- torch.sparse_coo_tensor
Zero sparse connectivity matrix.
- topobenchmarkx.data.utils.utils.get_complex_connectivity(complex, max_rank, signed=False)[source]#
Get the connectivity matrices for the complex.
- Parameters:
- complextopnetx.CellComplex or topnetx.SimplicialComplex
Cell complex.
- max_rankint
Maximum rank of the complex.
- signedbool, optional
If True, returns signed connectivity matrices.
- Returns:
- dict
Dictionary containing the connectivity matrices.
- topobenchmarkx.data.utils.utils.load_cell_complex_dataset(cfg)[source]#
Load cell complex datasets.
- Parameters:
- cfgDictConfig
Configuration parameters.
- topobenchmarkx.data.utils.utils.load_manual_graph()[source]#
Create a manual graph for testing purposes.
- Returns:
- torch_geometric.data.Data
Manual graph.