Data#

The data module of TopoBenchmarkX consists of several submodules:

  1. datasets

  2. load

  3. preprocess

  4. utils

Datasets#

Dataset class for US County Demographics dataset.

class topobenchmarkx.data.datasets.us_county_demos_dataset.USCountyDemosDataset(root: str, name: str, parameters: DictConfig)[source]#

Dataset class for US County Demographics dataset.

Parameters:
rootstr

Root directory where the dataset will be saved.

namestr

Name of the dataset.

parametersDictConfig

Configuration parameters for the dataset.

Attributes:
URLS (dict): Dictionary containing the URLs for downloading the dataset.
FILE_FORMAT (dict): Dictionary containing the file formats for the dataset.
RAW_FILE_NAMES (dict): Dictionary containing the raw file names for the dataset.
download() None[source]#

Download the dataset from a URL and saves it to the raw directory.

Raises:

FileNotFoundError – If the dataset URL is not found.

process() None[source]#

Handle the data for the dataset.

This method loads the US county demographics data, applies any pre- processing transformations if specified, and saves the processed data to the appropriate location.

property processed_dir: str#

Return the path to the processed directory of the dataset.

Returns:
str

Path to the processed directory.

property processed_file_names: str#

Return the processed file name for the dataset.

Returns:
str

Processed file name.

property raw_dir: str#

Return the path to the raw directory of the dataset.

Returns:
str

Path to the raw directory.

property raw_file_names: list[str]#

Return the raw file names for the dataset.

Returns:
list[str]

List of raw file names.

Load#

Abstract Loader class.

class topobenchmarkx.data.loaders.base.AbstractLoader(parameters: DictConfig)[source]#

Abstract class that provides an interface to load data.

Parameters:
parametersDictConfig

Configuration parameters.

abstract load() Data[source]#

Load data into Data.

Raises:
NotImplementedError

If the method is not implemented.

Data loaders.

class topobenchmarkx.data.loaders.loaders.CellComplexLoader(parameters: DictConfig)[source]#

Loader for cell complex datasets.

Parameters:
parametersDictConfig

Configuration parameters.

load() Dataset[source]#

Load cell complex dataset.

Returns:
torch_geometric.data.Dataset

Dataset object containing the loaded data.

class topobenchmarkx.data.loaders.loaders.GraphLoader(parameters: DictConfig, **kwargs)[source]#

Loader for graph datasets.

Parameters:
parametersDictConfig

Configuration parameters.

**kwargsdict

Additional keyword arguments.

Notes

The parameters must contain the following keys: - data_dir (str): The directory where the dataset is stored. - data_name (str): The name of the dataset. - data_type (str): The type of the dataset. - split_type (str): The type of split to be used. It can be “fixed”, “random”, or “k-fold”. If split_type is “random”, the parameters must also contain the following keys: - data_seed (int): The seed for the split. - data_split_dir (str): The directory where the split is stored. - train_prop (float): The proportion of the training set. If split_type is “k-fold”, the parameters must also contain the following keys: - data_split_dir (str): The directory where the split is stored. - k (int): The number of folds. - data_seed (int): The seed for the split. The parameters can be defined in a yaml file and then loaded using omegaconf.OmegaConf.load(‘path/to/dataset/config.yaml’).

load() tuple[Dataset, str][source]#

Load graph dataset.

Returns:
tuple[torch_geometric.data.Dataset, str]

Tuple containing the loaded data and the data directory.

class topobenchmarkx.data.loaders.loaders.HypergraphLoader(parameters: DictConfig)[source]#

Loader for hypergraph datasets.

Parameters:
parametersDictConfig

Configuration parameters.

load() Dataset[source]#

Load hypergraph dataset.

Returns:
torch_geometric.data.Dataset

Dataset object containing the loaded data.

class topobenchmarkx.data.loaders.loaders.SimplicialLoader(parameters: DictConfig)[source]#

Loader for simplicial datasets.

Parameters:
parametersDictConfig

Configuration parameters.

Returns:
torch_geometric.data.Dataset

torch_geometric.data.Dataset object containing the loaded data.

load() Dataset[source]#

Load simplicial dataset.

Returns:
torch_geometric.data.Dataset

Dataset object containing the loaded data.

Preprocess#

Preprocessor for datasets.

class topobenchmarkx.data.preprocessor.preprocessor.PreProcessor(dataset, data_dir, transforms_config=None, **kwargs)[source]#

Preprocessor for datasets.

Parameters:
datasetlist

List of data objects.

data_dirstr

Path to the directory containing the data.

transforms_configDictConfig, optional

Configuration parameters for the transforms (default: None).

**kwargsoptional

Optional additional arguments.

instantiate_pre_transform(data_dir, transforms_config) Compose[source]#

Instantiate the pre-transforms.

Parameters:
data_dirstr

Path to the directory containing the data.

transforms_configDictConfig

Configuration parameters for the transforms.

Returns:
torch_geometric.transforms.Compose

Pre-transform object.

load(path: str) None[source]#

Load the dataset from the file path path.

Parameters:
pathstr

The path to the processed data.

load_dataset_splits(split_params) tuple[DataloadDataset, DataloadDataset | None, DataloadDataset | None][source]#

Load the dataset splits.

Parameters:
split_paramsdict

Parameters for loading the dataset splits.

Returns:
tuple

A tuple containing the train, validation, and test datasets.

process() None[source]#

Method that processes the data.

property processed_dir: str#

Return the path to the processed directory.

Returns:
str

Path to the processed directory.

property processed_file_names: str#

Return the name of the processed file.

Returns:
str

Name of the processed file.

save_transform_parameters() None[source]#

Save the transform parameters.

set_processed_data_dir(pre_transforms_dict, data_dir, transforms_config) None[source]#

Set the processed data directory.

Parameters:
pre_transforms_dictdict

Dictionary containing the pre-transforms.

data_dirstr

Path to the directory containing the data.

transforms_configDictConfig

Configuration parameters for the transforms.

Utils#

Data IO utilities.

topobenchmarkx.data.utils.io_utils.download_file_from_drive(file_link, path_to_save, dataset_name, file_format='tar.gz')[source]#

Download a file from a Google Drive link and saves it to the specified path.

Parameters:
file_linkstr

The Google Drive link of the file to download.

path_to_savestr

The path where the downloaded file will be saved.

dataset_namestr

The name of the dataset.

file_formatstr, optional

The format of the downloaded file. Defaults to “tar.gz”.

Raises:
None
topobenchmarkx.data.utils.io_utils.get_file_id_from_url(url)[source]#

Extract the file ID from a Google Drive file URL.

Parameters:
urlstr

The Google Drive file URL.

Returns:
str

The file ID extracted from the URL.

Raises:
ValueError

If the provided URL is not a valid Google Drive file URL.

topobenchmarkx.data.utils.io_utils.load_hypergraph_pickle_dataset(cfg)[source]#

Load hypergraph datasets from pickle files.

Parameters:
cfgDictConfig

Configuration parameters.

Returns:
torch_geometric.data.Data

Hypergraph dataset.

topobenchmarkx.data.utils.io_utils.read_us_county_demos(path, year=2012, y_col='Election')[source]#

Load US County Demos dataset.

Parameters:
pathstr

Path to the dataset.

yearint, optional

Year to load the features (default: 2012).

y_colstr, optional

Column to use as label. Can be one of [‘Election’, ‘MedianIncome’, ‘MigraRate’, ‘BirthRate’, ‘DeathRate’, ‘BachelorRate’, ‘UnemploymentRate’] (default: “Election”).

Returns:
torch_geometric.data.Data

Data object of the graph for the US County Demos dataset.

Split utilities.

topobenchmarkx.data.utils.split_utils.assing_train_val_test_mask_to_graphs(dataset, split_idx)[source]#

Split the graph dataset into train, validation, and test datasets.

Parameters:
datasettorch_geometric.data.Dataset

Considered dataset.

split_idxdict

Dictionary containing the train, validation, and test indices.

Returns:
list:

List containing the train, validation, and test datasets.

topobenchmarkx.data.utils.split_utils.k_fold_split(labels, parameters)[source]#

Return train and valid indices as in K-Fold Cross-Validation.

If the split already exists it loads it automatically, otherwise it creates the split file for the subsequent runs.

Parameters:
labelstorch.Tensor

Label tensor.

parametersDictConfig

Configuration parameters.

Returns:
dict

Dictionary containing the train, validation and test indices, with keys “train”, “valid”, and “test”.

topobenchmarkx.data.utils.split_utils.load_coauthorship_hypergraph_splits(data, parameters, train_prop=0.5)[source]#

Load the split generated by rand_train_test_idx function.

Parameters:
datatorch_geometric.data.Data

Graph dataset.

parametersDictConfig

Configuration parameters.

train_propfloat

Proportion of training data.

Returns:
torch_geometric.data.Data:

Graph dataset with the specified split.

topobenchmarkx.data.utils.split_utils.load_inductive_splits(dataset, parameters)[source]#

Load multiple-graph datasets with the specified split.

Parameters:
datasettorch_geometric.data.Dataset

Graph dataset.

parametersDictConfig

Configuration parameters.

Returns:
list:

List containing the train, validation, and test splits.

topobenchmarkx.data.utils.split_utils.load_transductive_splits(dataset, parameters)[source]#

Load the graph dataset with the specified split.

Parameters:
datasettorch_geometric.data.Dataset

Graph dataset.

parametersDictConfig

Configuration parameters.

Returns:
list:

List containing the train, validation, and test splits.

topobenchmarkx.data.utils.split_utils.random_splitting(labels, parameters, global_data_seed=42)[source]#

Randomly splits label into train/valid/test splits.

Adapted from CUAI/Non-Homophily-Benchmarks.

Parameters:
labelstorch.Tensor

Label tensor.

parametersDictConfig

Configuration parameter.

global_data_seedint

Seed for the random number generator.

Returns:
dict:

Dictionary containing the train, validation and test indices with keys “train”, “valid”, and “test”.

Data utilities.

topobenchmarkx.data.utils.utils.ensure_serializable(obj)[source]#

Ensure that the object is serializable.

Parameters:
objobject

Object to ensure serializability.

Returns:
object

Object that is serializable.

topobenchmarkx.data.utils.utils.generate_zero_sparse_connectivity(m, n)[source]#

Generate a zero sparse connectivity matrix.

Parameters:
mint

Number of rows.

nint

Number of columns.

Returns:
torch.sparse_coo_tensor

Zero sparse connectivity matrix.

topobenchmarkx.data.utils.utils.get_complex_connectivity(complex, max_rank, signed=False)[source]#

Get the connectivity matrices for the complex.

Parameters:
complextopnetx.CellComplex or topnetx.SimplicialComplex

Cell complex.

max_rankint

Maximum rank of the complex.

signedbool, optional

If True, returns signed connectivity matrices.

Returns:
dict

Dictionary containing the connectivity matrices.

topobenchmarkx.data.utils.utils.load_cell_complex_dataset(cfg)[source]#

Load cell complex datasets.

Parameters:
cfgDictConfig

Configuration parameters.

topobenchmarkx.data.utils.utils.load_manual_graph()[source]#

Create a manual graph for testing purposes.

Returns:
torch_geometric.data.Data

Manual graph.

topobenchmarkx.data.utils.utils.load_simplicial_dataset(cfg)[source]#

Load simplicial datasets.

Parameters:
cfgDictConfig

Configuration parameters.

Returns:
torch_geometric.data.Data

Simplicial dataset.

topobenchmarkx.data.utils.utils.make_hash(o)[source]#

Make a hash from a dictionary, list, tuple or set to any level, that contains only other hashable types.

Parameters:
odict, list, tuple, set

Object to hash.

Returns:
int

Hash of the object.