{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train a hypergraph neural network using UniGCNII layers\n",
    "\n",
    "This tutorial consists of three main steps:\n",
    "1. Loading the CiCitationCora dataset and lifting it to the hypergraph domain.\n",
    "2. Defining a hypergraph neural network (HGNN) which utlilizes the UniGCNII layer, and\n",
    "3. Training the obtained neural network on the training data and evaluating it on test data.\n",
    "\n",
    "First, we import the neccessary packages.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import torch\n",
    "import torch_geometric.datasets as geom_datasets\n",
    "from torch_geometric.utils import to_undirected\n",
    "\n",
    "from topomodelx.nn.hypergraph.unigcnii import UniGCNII\n",
    "\n",
    "torch.manual_seed(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If GPUs are available, we want to make use of them, otherwise the model is run on CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cuda\n"
     ]
    }
   ],
   "source": [
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "print(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pre-processing\n",
    "\n",
    "## Import data ##\n",
    "\n",
    "The first step is to import the dataset, Cora, a benchmark classification datase. We then lift the graph into our domain of choice, a hypergraph.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.11/site-packages/torch_geometric/data/in_memory_dataset.py:284: UserWarning: It is not recommended to directly access the internal storage format `data` of an 'InMemoryDataset'. If you are absolutely certain what you are doing, access the internal storage via `InMemoryDataset._data` instead to suppress this warning. Alternatively, you can access stacked individual attributes of every graph via `dataset.{attr_name}`.\n",
      "  warnings.warn(msg)\n"
     ]
    }
   ],
   "source": [
    "cora = geom_datasets.Planetoid(root=\"tmp/\", name=\"cora\")\n",
    "data = cora.data\n",
    "\n",
    "x_0s = data.x\n",
    "y = data.y\n",
    "edge_index = data.edge_index\n",
    "\n",
    "train_mask = data.train_mask\n",
    "val_mask = data.val_mask\n",
    "test_mask = data.test_mask"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define neighborhood structures and lift into hypergraph domain. ##\n",
    "\n",
    "Now we retrieve the neighborhood structure (i.e. their representative matrice) that we will use to send messges from node to hyperedges. In the case of this architecture, we need the boundary matrix (or incidence matrix) $B_1$ with shape $n_\\text{nodes} \\times n_\\text{edges}$.\n",
    "\n",
    "In citation Cora dataset we lift graph structure to the hypergraph domain by creating hyperedges from 1-hop graph neighbourhood of each node. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ensure the graph is undirected (optional but often useful for one-hop neighborhoods).\n",
    "edge_index = to_undirected(edge_index)\n",
    "\n",
    "# Create a list of one-hop neighborhoods for each node.\n",
    "one_hop_neighborhoods = []\n",
    "for node in range(data.num_nodes):\n",
    "    # Get the one-hop neighbors of the current node.\n",
    "    neighbors = data.edge_index[1, data.edge_index[0] == node]\n",
    "\n",
    "    # Append the neighbors to the list of one-hop neighborhoods.\n",
    "    one_hop_neighborhoods.append(neighbors.numpy())\n",
    "\n",
    "# Detect and eliminate duplicate hyperedges.\n",
    "unique_hyperedges = set()\n",
    "hyperedges = []\n",
    "for neighborhood in one_hop_neighborhoods:\n",
    "    # Sort the neighborhood to ensure consistent comparison.\n",
    "    neighborhood = tuple(sorted(neighborhood))\n",
    "    if neighborhood not in unique_hyperedges:\n",
    "        hyperedges.append(list(neighborhood))\n",
    "        unique_hyperedges.add(neighborhood)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally we print the statictis associated with obtained incidence matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can load the loaded dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hyperedge statistics: \n",
      "Number of hyperedges without duplicated hyperedges 2581\n",
      "min = 1, \n",
      "max = 168, \n",
      "mean = 4.003099573808601, \n",
      "median = 3.0, \n",
      "std = 5.327622607829558, \n",
      "Number of hyperedges with size equal to one = 412\n"
     ]
    }
   ],
   "source": [
    "# Calculate hyperedge statistics.\n",
    "hyperedge_sizes = [len(he) for he in hyperedges]\n",
    "min_size = min(hyperedge_sizes)\n",
    "max_size = max(hyperedge_sizes)\n",
    "mean_size = np.mean(hyperedge_sizes)\n",
    "median_size = np.median(hyperedge_sizes)\n",
    "std_size = np.std(hyperedge_sizes)\n",
    "num_single_node_hyperedges = sum(np.array(hyperedge_sizes) == 1)\n",
    "\n",
    "# Print the hyperedge statistics.\n",
    "print(\"Hyperedge statistics: \")\n",
    "print(\"Number of hyperedges without duplicated hyperedges\", len(hyperedges))\n",
    "print(f\"min = {min_size}, \")\n",
    "print(f\"max = {max_size}, \")\n",
    "print(f\"mean = {mean_size}, \")\n",
    "print(f\"median = {median_size}, \")\n",
    "print(f\"std = {std_size}, \")\n",
    "print(f\"Number of hyperedges with size equal to one = {num_single_node_hyperedges}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_edges = len(hyperedges)\n",
    "incidence_1 = np.zeros((x_0s.shape[0], max_edges))\n",
    "for col, neighibourhood in enumerate(hyperedges):\n",
    "    for row in neighibourhood:\n",
    "        incidence_1[row, col] = 1\n",
    "\n",
    "assert all(incidence_1.sum(0) > 0) is True, \"Some hyperedges are empty\"\n",
    "assert all(incidence_1.sum(1) > 0) is True, \"Some nodes are not in any hyperedges\"\n",
    "incidence_1 = torch.Tensor(incidence_1).to_sparse_coo()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a neural network"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the network that initializes the base model and sets up the readout operation.\n",
    "Different downstream tasks might require different pooling procedures.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Network(torch.nn.Module):\n",
    "    \"\"\"Network class that initializes the base model and readout layer.\n",
    "\n",
    "    Base model parameters:\n",
    "    ----------\n",
    "    Reqired:\n",
    "    in_channels : int\n",
    "        Dimension of the input features.\n",
    "    hidden_channels : int\n",
    "        Dimension of the hidden features.\n",
    "\n",
    "    Optitional:\n",
    "    **kwargs : dict\n",
    "        Additional arguments for the base model.\n",
    "\n",
    "    Readout layer parameters:\n",
    "    ----------\n",
    "    out_channels : int\n",
    "        Dimension of the output features.\n",
    "    task_level : str\n",
    "        Level of the task. Either \"graph\" or \"node\".\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self, in_channels, hidden_channels, out_channels, task_level=\"graph\", **kwargs\n",
    "    ):\n",
    "        super().__init__()\n",
    "\n",
    "        # Define the model\n",
    "        self.base_model = UniGCNII(\n",
    "            in_channels=in_channels, hidden_channels=hidden_channels, **kwargs\n",
    "        )\n",
    "\n",
    "        # Readout\n",
    "        self.linear = torch.nn.Linear(hidden_channels, out_channels)\n",
    "        self.out_pool = task_level == \"graph\"\n",
    "\n",
    "    def forward(self, x_0, incidence_1):\n",
    "        # Base model\n",
    "        x_0, x_1 = self.base_model(x_0, incidence_1)\n",
    "\n",
    "        # Pool over all nodes in the hypergraph\n",
    "        x = torch.max(x_0, dim=0)[0] if self.out_pool is True else x_0\n",
    "\n",
    "        return self.linear(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Base model hyperparameters\n",
    "in_channels = x_0s.shape[1]\n",
    "hidden_channels = 128\n",
    "n_layers = 2\n",
    "mlp_num_layers = 1\n",
    "input_drop = 0.5\n",
    "\n",
    "# Readout hyperparameters\n",
    "out_channels = torch.unique(y).shape[0]\n",
    "task_level = \"graph\" if out_channels == 1 else \"node\"\n",
    "\n",
    "\n",
    "model = Network(\n",
    "    in_channels=in_channels,\n",
    "    hidden_channels=hidden_channels,\n",
    "    out_channels=out_channels,\n",
    "    n_layers=n_layers,\n",
    "    input_drop=input_drop,\n",
    "    task_level=task_level,\n",
    ").to(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the neural network\n",
    "\n",
    "First, we specify the hyperparameters of the training process."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note: The number of epochs below have been kept low to facilitate debugging and testing. Real use cases should likely require more epochs.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_epochs = 5\n",
    "test_interval = 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we can generate the corresponding model, optimizer, and loss function with corresponding sizes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optimizer and loss\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n",
    "\n",
    "# Categorial cross-entropy loss\n",
    "loss_fn = torch.nn.CrossEntropyLoss()\n",
    "\n",
    "\n",
    "# Accuracy\n",
    "def acc_fn(y, y_hat):\n",
    "    return (y == y_hat).float().mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we are ready to train the created model and evaluate the performance on the validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_7355/1422611997.py:1: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).\n",
      "  x_0s = torch.tensor(x_0s)\n",
      "/tmp/ipykernel_7355/1422611997.py:5: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).\n",
      "  torch.tensor(y, dtype=torch.long).to(device),\n"
     ]
    }
   ],
   "source": [
    "x_0s = torch.tensor(x_0s)\n",
    "x_0s, incidence_1, y = (\n",
    "    x_0s.float().to(device),\n",
    "    incidence_1.float().to(device),\n",
    "    torch.tensor(y, dtype=torch.long).to(device),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 0               Train_loss: 6.3311, Train_acc: 0.1429               Test_loss: 8.4650, Test_acc: 0.1430\n",
      "Epoch: 5               Train_loss: 1.7255, Train_acc: 0.6000               Test_loss: 1.7909, Test_acc: 0.4460\n",
      "Epoch: 10               Train_loss: 1.4779, Train_acc: 0.6714               Test_loss: 1.6235, Test_acc: 0.5080\n",
      "Epoch: 15               Train_loss: 1.2046, Train_acc: 0.7786               Test_loss: 1.5192, Test_acc: 0.5790\n",
      "Epoch: 20               Train_loss: 0.8618, Train_acc: 0.8500               Test_loss: 1.4219, Test_acc: 0.6350\n",
      "Epoch: 25               Train_loss: 0.6326, Train_acc: 0.8643               Test_loss: 1.4752, Test_acc: 0.6700\n"
     ]
    }
   ],
   "source": [
    "for epoch in range(num_epochs):\n",
    "    # set model to training mode\n",
    "    model.train()\n",
    "\n",
    "    y_hat = model(x_0s, incidence_1)\n",
    "    loss = loss_fn(y_hat[train_mask], y[train_mask])\n",
    "\n",
    "    loss.backward()\n",
    "    optimizer.step()\n",
    "    optimizer.zero_grad()\n",
    "\n",
    "    if epoch % test_interval == 0:\n",
    "        model.eval()\n",
    "        y_hat = model(x_0s, incidence_1)\n",
    "\n",
    "        train_loss = loss_fn(y_hat[train_mask], y[train_mask])\n",
    "        loss = loss_fn(y_hat[test_mask], y[test_mask])\n",
    "        print(\n",
    "            f\"Epoch: {epoch} \\\n",
    "              Train_loss: {train_loss:.4f}, Train_acc: {acc_fn(y_hat[train_mask].argmax(1), y[train_mask]):.4f} \\\n",
    "              Test_loss: {loss:.4f}, Test_acc: {acc_fn(y_hat[test_mask].argmax(1), y[test_mask]):.4f}\",\n",
    "            flush=True,\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.11.6 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "97e7f600578393f7b22fad5e1bb04e54aa849deabd28651fd7e27af1b0c8a034"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}