1 GNN

Representation

Node embedding: Each node embedded as a vector, and the entire graph represented as adjacency matrix $A \in R^{n \times n}$ and feature matrix (attribute vector) $X \in R^{n \times n}$
Graph embedding: Entire graph represented as a vector.

Properties

Permutation invariance (Graph embedding), $f (P A P^{T}, P X) = f (A, X)$ : Permuting the node embeddings and attribute vector has no effect on the output. When we want to predict anything about the entire graph, for example classifying a molecule, then we want the model to permutation invariant.
Permutation equivariance (Node embedding), $f (P A P^{T}, P X) = P f (A, X)$ : Permuting the node embeddings and attribute vector is equivalent to first applying the function, and then permuting. Or in other words, translation of input features result in an equivalent translation of outputs, or permuting the adjancency matrix means the output of f is permuted in a consistent way.
We need to satisfy either of the two: i.e. invariance or equivariance.

Why do we not represent a graph as an MLP?
Why do we not represent as a CNN? what are the similarities?

Similarity: Locality, Weight Sharing, arbitrary input size
Difference: Abstract shape

1.1 Message Passing

Take a graph: $G = (V, E)$ along with a set of node features $X \in R^{d \times ∣ V ∣}$ , and generate node embeddings $z_{u}, \forall u \in V$ .

what can node features be?

Depends on the problem we’re solving. For molecular graphs, this can be information for each atom in the molecule, for social graphs, information can be about each individual member in the graph. For no individual node features, input can still be statistics of the node in the graph, and can even contain some more information about the graph itself.
To break permutation equivariance, we can assign positional encoding to each node, say as a one-hot vector encoding.

Aggregate: in each round k, each node aggregates the message (feature description) from its neighbours and update the weights. $m_{N (v)}^{(k)} = Agg^{(k)} ({h_{u}^{(k - 1)} : u \in N (v)})$

Initial embedding is set to be features of the node: $h_{u}^{(0)} = x_{u}, \forall u \in V$
Differentiable, multiset function. Input being a set designs the graph as Permutation equivariant.
Sum, Mean, Max/Min.
$m_{N (v)}^{(k)} = MLP_{2} (\sum_{u \in N (v)} MLP_{1} (h_{u}, h_{v}))$ : Universal approximation of multiset functions.
Receptive field of the graph increases with each iteration, as more information further away from the node is aggregated.
Messages from the neighbours can encode structural information like degree of the neighbour node, useful in problems like analysing molecular graphs. Or can also encode feature-based information from local neighbourhood of the graph analogous to how CNNs aggregate feature information from spatially-defined patches.

Update: $h_{v}^{(k)} = Update^{(k)} (h_{v}^{(k - 1)}, m_{N (v)}^{(k)})$ .

$h_{v}^{(k)} = σ (W_{self} h_{v}^{(k - 1)} + W_{neigh} m_{N (v)}^{(k)} + b^{(k)})$ , where $W_{self}, W_{neigh} \in R^{d^{(k)} \times d^{(k - 1)}}$ and $b^{(k)} \in R^{d^{(k)}}$

Readout: $h_{G} = READOUT ({h_{v}^{(K)} : v \in G})$ : Outputs a final result after K final iteration. Like pooling in CNNs.

We can also define a graph-level equation for aggregate and update, and we can even batch aggregate and update in one equation using self-loops:

H^{(k)} = σ (A H^{(k - 1)} W_{neigh}^{(k)} + H^{(k - 1)} W_{self}^{(k)} + b^{(k)}) H^{(k)} = σ ((A + I) H^{(k - 1)} W^{(k)})

Similarity to MLP with each node’s weight being a vector instead of a scalar, and aggregate and update together forms the linear and pointwise layer. So actually, a GNN can learn much more than a simple MLP.

Every iteration weight of single node gets updated using it's neighbour, but that d-dimensional vector will be saturated after many iterations? Will a single node embedding contain any meaningful information from its neighbourhood?

Over-smoothing: Representation of all the nodes in the graph can become very similar to one another. This makes it impossible to build deeper GNN models.

Formally, Define influence of a node’s input feature $h_{u}^{(0)} = x_{u}$ on the final layer embedding of all other nodes in the graph $h_{v}^{(K)}, \forall v \in V$ . For any pair of nodes u, v in the graph, influence of u on v is quantified using the Jacobian $I_{K} (u, v) = 1^{T} (\frac{\partial h _{v}^{(K)}}{\partial h _{u}^{(0)}}) 1$ .

Building deeper models can hurt performance of GNN models as with each added layer, information loss about local neighbourhood increases and learned embeddings are over-smoothed.

More on the influence of self-update and deeper models can be found in GRL book and Xu et. al..

1.2 Generalisations

1.2.1 Aggregate

Normalisation: Mean normalisation $m_{N (u)} = \frac{\sum _{v \in N (u)} h _{v}}{∣ N ( u )∣}$ or symmetric normalisation $m_{N (u)} = \frac{\sum _{v \in N (u)} h _{v}}{∣ N ( u ) N ( v )∣}$ .
- Why does symmetric normalisation work better than mean?
- Normalisation leads to loss of structural information, as is provably, less powerful than sum aggregation. Normalisation is more useful when feature information is more useful than structural information.
Set pooling: Use a universal set function approximator that can approximate any permutation-invariant aggregator function: $m_{N (v)}^{(k)} = MLP_{2} (\sum_{u \in N (v)} MLP_{1} (h_{u}, h_{v}))$
Janossy Pooling: Permutation sensitive function averaged over permutations
Attention: $m_{N (v)} = \sum_{u \in N (v)} α_{v u} h_{u}$ : weighted aggregation. Is used to increase the inductive bias of the model with prior information about importance of neighbours.
Multiple attention heads: Compute K distinct attention weights using independent parametrised attention layers. Aggregate all message by projection and concatenation.
- $m_{N (u)} = [a_{1} \oplus a_{2} \oplus \dots \oplus a_{K}]$
- $a_{k} = W_{k} \sum_{v \in N (u)} α_{v, u, k} h_{v}$

1.2.2 Update

Skip connections: Counter over-smoothing by directly preserving information from previous rounds of message passing.
- $Update_{concat} (h_{u}, m_{N (u)}) = [Update_{base} (h_{u}, m_{N (u)}) \oplus h_{u}]$
- $Update_{interpolate} (h_{u}, m_{N (u)}) = [α_{1} Update_{base} (h_{u}, m_{N (u)}) \oplus α_{2} h_{u}]$ , where $α_{1}, α_{2} \in [0, 1]^{d}$ and $α_{2} = 1 - α_{1}$ , and $α_{1}$ can be learned jointly with other representations.
- Due to the analogous properties of CNNs, concatenation and skip connection as described in He et. al produces similar results.

1.2.3 Features and Relationships

Edge attributes: $m_{N (v)} = \sum_{u \in N (v)} MLP^{(k)} (h_{u}^{(k - 1)}, h_{v}^{(k - 1)}, w_{uv})$
Multi-relational: aggregation can depend on the relationship between nodes.

1.2.4 Generalised Message Passing

h_{(u, v)}^{(k)} m_{N (u)} h_{u}^{(k)} h_{G}^{(k)} = Update_{edge} (h_{(u, v)}^{(k - 1)}, h_{u}^{(k - 1)}, h_{v}^{(k - 1)}, h_{G}^{(k - 1)}) = Aggregate_{node} ({h_{(u, v)}^{(k)} \forall v \in N (u)}) = Update_{node} (h_{u}^{(k - 1)}, m_{N (u)}, h_{G}^{(k - 1)}) = Update_{graph} (h_{G}^{(k - 1)}, {h_{u}^{(k)} \forall u \in V}, {h_{(u, v)}^{(k)} \forall (u, v) \in E})

Main improvement over baseline message passing is that during each iteration, the model generates a hidden edge embedding for all edges in the graph, and an overall graph embedding corresponding to the entire graph. This helps differentiate between edge and node level features and entire graph-level features. We can also define different loss functions for different type of embeddings, and tasks.

1.3 Approximation Theory

Graph Isomorphisms: Given two graphs $G_{1}, G_{2}$ , declare whether two graphs are isomorphic. Formally, we say two graphs with adjacency matrix $A_{1}, A_{2}$ and feature matrix $X_{1}, X_{2}$ are isomorphic if and only if there exists a permutation matrix P such that $P A_{1} P^{T} = A_{2}$ and $P X_{1} = X_{2}$ . Or informally, when they have same structure but differ in ordering of nodes in their adjacency matrices.

GL test and HOW POWERFUL ARE GRAPH NEURAL NETWORKS?

Weisfeiler-Lehman isomorphism test
Distinguishing capacity of GNNs

1.4 Problems I’m Seeing:

A d-dimensional weight vector exists for each node in the graph, so the size scales with $O (N d)$ . And number of edges will also mean that update function will be hard to compute, but is actually emabarrasingly parallel.
How to choose depth K?
How do you update the graph structure as you process more? Can we prune or connect more edges and nodes?

lonerapier.me

Table of Contents

Backlinks

Explorer

Graph Neural Networks

1 GNN

1.1 Message Passing

1.2 Generalisations

1.2.1 Aggregate

1.2.2 Update

1.2.3 Features and Relationships

1.2.4 Generalised Message Passing

1.3 Approximation Theory

1.4 Problems I’m Seeing:

1.5 More Readings: