SGL: Scalable Graph Learning¶
SGL is a Graph Neural Network (GNN) toolkit targeting scalable graph learning, which supports deep graph learning on extremely large datasets. SGL allows users to easily implement scalable graph neural networks and evaluate its performance on various downstream tasks like node classification, node clustering, and link prediction. Further, SGL supports auto neural architecture search functionality based on OpenBox. SGL is designed and developed by the graph learning team from the DAIR Lab at Peking University.
Library Highlights¶
High scalability: Follow the scalable design paradigm SGAP in PaSca, SGL scale to graph data with billions of nodes and edges.
Auto neural architecture search: Automatically choose decent neural architectures according to specific tasks, and pre-defined objectives (e.g., inference time).
Ease of use: User-friendly interfaces of implementing existing scalable GNNs and executing various downstream tasks.
License¶
The entire codebase is under MIT license.
Overview¶
SGL is a Graph Neural Network (GNN) toolkit targeting scalable graph learning, which supports deep graph learning on extremely large datasets. SGL allows users to easily implement scalable graph neural networks and evaluate its performance on various downstream tasks like node classification, node clustering, and link prediction. Further, SGL supports auto neural architecture search functionality based on OpenBox. SGL is designed and developed by the graph learning team from the DAIR Lab at Peking University.
Main Functionalities¶
A handy platform for implementing and evaluating scalable GNNs.
Scalable learning on various graph-related tasks, including node classification, node clustering, and link prediction.
Auto neural architecture search on given tasks, datasets and objectives.
Training paradigm¶
The main design goal of SGL is to support scalable graph learning. SGL adopts the scalable training paradigm, SGAP (Scalable Graph Architecture Paradigm), in PaSca. SGAP split the conventional GNN training process into three independent stages — Preprocessing, Training, and Postprocessing, which can be represented as follows:
- Preprocessing: \(\textbf{M}=graph\_propagate(\textbf{A}, \textbf{X})\); \(\textbf{X}'=message\_aggregate(\textbf{M})\)
SGAP propagates and aggregates information at the graph level.
- Training: \(\textbf{Y}=model\_train(\textbf{X}')\)
SGAP feeds the propagated and aggregated information into a machine learning model (e.g., SVM, MLP) for training.
- Postprocessing: \(\textbf{M}'=graph\_propagate(\textbf{A},\textbf{Y})\); \(\textbf{Y}'=message\_aggregate(\textbf{M}')\)
SGAP again propagates and aggregates the outputs of the previous stage at the graph level.
Note
The first \(message\_aggregate\) operation in the Preprocessing stage will be transferred to the Training stage if it contains learnable parameters; and the second \(message\_aggregate\) operation in the Postprocessing stage is prohibited to contain learnable parameters.
Compared to conventional GNN training process, SGAP has mainly two advantages:
The time- and resource-consuming propagation operation is only executed two times during the full training process; while the number of executing propagation in the conventional GNN training process equals to the number of training epochs, which is usually far greater than two.
The dependencies between training examples have been fully taken care of in the Preprocessing stage. Thus, the training examples can be freely split to small batches to feed into the model in the Training stage, which boosts the efficiency and the scalability of the training process.
Model construction paradigm¶
Corresponding to its training paradigm, SGAP, SGL needs to define the behaviors of two \(graph\_propagate\) for each GNN model. To fulfill this goal, SGL designs three important modules:
Graph Operator: to carry out the functionality of \(graph\_propagate\). It receives the adjacency matrix \(\textbf{A}\) and the node representation matrix \(\textbf{X}\), and outputs a list of propagated information matrices of different propagation depths.
Message Operator: to carry out the functionality of \(message\_aggregate\). It receives a list of propagated information matrices and aggregates the matrices according to pre-defined behaviors. The final output of each Message Operator is a single matrix.
Base Model: to carry out the functionality of \(model\_training\). It can be not only deep learning models like MLP, but also traditional machine learning methods like SVM and random forest.
To construct a GNN model in SGL, the users only need to fill in some blanks with pre-/user-defined Graph Operators, Message Operators and Base Models. Please refer to models part for the detailed API for constructing models. SGL also provides simple interfaces for defining new Graph Operators and Message Operators, please refer to operators part for more details.
Installation¶
Some datasets in SGL are constructed based on PyG.
Please follow the instructions below to install PyG first before installing SGL: https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html.
Install from pip¶
When PyG has been installed, SGL can be installed from PyPI by:
pip install sgl-dair
Quick Start¶
In this short tutorial, we will qucikly go through the basic and the advanced usage of SGL. The tutorial is composed of following parts:
Basic usage¶
In this part, we will introduce the basic usage of SGL, including how to excute graph-related tasks and how to use the NAS (Neural Architecture Search) functionality.
Auto neural architrcture search (TODO)¶
Advanced usage¶
In this part, we will introduce the advanced usage of SGL, including adopting user-defined datasets, building models under SGAP paradigm, implementing new graph operators and message operators.
Adopt user-defined datasets¶
SGL designs two base classes, NodeDataset
and HeteroNodeDataset
, for the homogeneous graph datasets and the heterogeneous graph datasets, respectively.
We will take implementing a homogeneous graph dataset as an example below to explain how to adopt user-defined datasets.
To implement a new homogeneous graph dataset, one has to first to inherit the base class NodeDataset
, whose detailed introduction can be found in the data part.
Then, there exist two important virtual functions to implement:
download
: download the raw files of the dataset from the Interent and store them in pre-defined places;process
: process the raw files fetched bydownload
and store the processed file defined by the data classGraph
.
The data class Graph
is designed to store the critical data for the homogeneous graph; the corresponding data class for the heterogeneous graph is HeteroGraph
.
To instantiate Graph
, one needs to at least provide the following information:
row
: the row index of the edges in the graph;col
: the column index of the edges in the graph;edge_weight
: the weight of the edges in the graph;edge_type
: the type of the edges in the graph;num_node
: the total number of nodes in the graph;node_type
: the type of the nodes in the graph.
The datasets in the datasets part all follow the same construction scheme.
Please refer to the data part for more detailed introduction of the two base classes, NodeDataset
and HeteroNodeDataset
.
Build models under SGAP paradigm¶
SGL adopts the SGAP (Scalable Graph Architecture Paradigm) as its training paradigm. Corresponding to that, the model construction paradigm differs from the conventional message passing paradigm. The detailed introduciton of the model construction paradigm of SGL is provided in overview. Below will explain how to build a SGC in SGL.
As introduced in overview, a GNN model in SGL is composed of five parts:
pre_graph_op, pre_msg_op: Graph Operator and Message Operator for the Preprocessing stage;
base_model: Base Model for the Training stage;
post_graph_op, post_msg_op: Graph Operator and Message Operator for the Postprocessing stage.
Thus, users only have to assign each module with pre-/user-defined Graph Operator/Message operator/Base Model when building models after inheriting the base class BaseSGAPModel
.
The behaviors of the adopted different Graph Operators, Message Operators and Base Models determine the behaviors of the built GNN models.
The code of building SGC is provided below:
from sgl.models.base_model import BaseSGAPModel
from sgl.models.simple_models import LogisticRegression
from sgl.operators.graph_op import LaplacianGraphOp
from sgl.operators.message_op import LastMessageOp
class SGC(BaseSGAPModel):
def __init__(self, prop_steps, feat_dim, output_dim):
super(SGC, self).__init__(prop_steps, feat_dim, output_dim)
self._pre_graph_op = LaplacianGraphOp(prop_steps, r=0.5)
self._pre_msg_op = LastMessageOp()
self._base_model = LogisticRegression(feat_dim, output_dim)
Note
The LaplacianGraphOp, LastMessageOp,and LogisticRegreesion are pre-defined Graph Operator, Message Operator, and Base Model, respectively.
Note
SGC does not have the Postprocessing stage in its training process. Thus, the modules used for the Postprocessing stage do not exist in the construction of SGC.
In the following parts of this tutorial, we will introduce ways to implement new Graph Operators and Message Operators.
Implement new Graph Operators¶
As introduced in overview, the behaviors of the Graph Operators can be represented as follows: \(\textbf{M}=graph\_propagate(\textbf{A}, \textbf{X})\). Thus, the critical part of implementing new Graph Operators is to determine the value of the matrix \(\textbf{A}\).
In SGL, users only need to implement the virtual function construct_adj, which takes in the original adjacency matrix of the graph and outputs the desired propagation matrix after inheriting the base class GraphOp
.
Below is the implementation of the PPR (Personalized PageRank) Graph Operator:
class PprGraphOp(GraphOp):
def __init__(self, prop_steps, r=0.5, alpha=0.15):
super(PprGraphOp, self).__init__(prop_steps)
self.__r = r
self.__alpha = alpha
def _construct_adj(self, adj):
adj_normalized = adj_to_symmetric_norm(adj, self.__r)
adj_normalized = (1 - self.__alpha) * adj_normalized + self.__alpha * sp.eye(adj.shape[0])
return adj_normalized.tocsr()
Please refer to operators part for more detailed introduction.
Implement new Message Operators¶
Similar to implementing new Graph Operators, implementing new Message Operators is easy in SGL. The users need to determine the behaviors of the new Message Operators represented in \(\textbf{X}'=message\_aggregate(\textbf{M})\).
Practically speaking, users have to implement the virtual function combine function after inheriting the base class MessageOp
.
The code below provides the implementation of the ConcatMessageOp in SGL:
class ConcatMessageOp(MessageOp):
def __init__(self, start, end):
super(ConcatMessageOp, self).__init__(start, end)
self._aggr_type = "concat"
def _combine(self, feat_list):
return torch.hstack(feat_list[self._start:self._end])
Please refer to operators part for more detailed introduction.
sgl.data¶
sgl.datasets¶
sgl.operators.graph_op¶
sgl.operators.message_op¶
- class sgl.operators.message_op.IterateLearnableWeightedMessageOp(start, end, combination_type, *args)[source]¶
Bases:
MessageOp
- class sgl.operators.message_op.LearnableWeightedMessageOp(start, end, combination_type, *args)[source]¶
Bases:
MessageOp
- class sgl.operators.message_op.ProjectedConcatMessageOp(start, end, feat_dim, hidden_dim, num_layers)[source]¶
Bases:
MessageOp