1
$\begingroup$

In the GT (Graph Transformer) model presented by Dwivedi & Bresson in "A Generalization of Transformer Networks to Graphs", the following equations are used to update node and edge representations at each layer ($h_i^\ell$ is the node representation of node $i$ and layer $\ell$, and $e_{ij}^\ell$ is an edge representation. The double vertical bar symbol refers to concatenation over different attention heads. All symbols which are capital letters are learnable weight matricies.):

GT layer update equations

but I am confused as to how $\hat{w}_{ij}^{k,\ell}$ is computed. Each pre-softmax attention weight $\hat{w}_{ij}^{k,\ell}$ should be a scalar, however the dot product $$Q^{k,\ell}h_i^\ell\cdot K^{k,\ell}h_j^\ell$$ is clearly a scalar, while $E^{k,\ell}e^\ell_{ij}$ is a vector, so if one is to interpret equation (12) as taking the dot product of the key and query representations, then multiplying this scalar by the edge representation vector, this yields a vector $\hat{w}_{ij}^{k,\ell}$ (when we want a scalar). Surely the authors do not intend $Q^{k,\ell}h_i^\ell\cdot K^{k,\ell}h_j^\ell\cdot E^{k,\ell}e^\ell_{ij}$ to mean some sort of three-way dot product (this is not well defined). What am I misunderstanding about equation (12)?

Beyond the question of this particular graph transformer model, I am also wondering more generally: what are the simplest ways one can incorporate edge type/edge feature information into a basic multi-head attention model (in which each node attends to all other nodes) for learning on graph data?

I have seen other papers (the Graphormer model) which do things like biasing the attention with an average of the edge representations taken over a shortest path between two nodes, or (the graphGPS model) combining global attention with an MPNN which uses edge features in message-passing. But I am looking for a more "minimalistic" approach to incorporating edge types. I want the simplest approach which is (in some non-rigorous/heuristic sense) the most natural extension of vanilla self-attention on node features to vanilla self-attention on edge-typed graphs; I am not concerned with finding the most powerful or high performing approach.

One way i was thinking of doing this is simply embedding the edge types into vectors with the same dimensionality as the node features, and simply running nn.MultiHeadAttention on the set of $\{$edge representations $\cup$ node representations$\}$, but the issue with this is that the model would have no way to distinguish between edge features and node features. (maybe I could solve this issue by having separate key, query, value weight matrices for the edge vectors and node vectors? idk.)

Bonus points if anyone can suggest a model which is very easily to implement using the pytorch nn.MultiHeadAttention, or if there is an available implementation on github.

New contributor
td12345 is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.