Simplest way to incorporate edge types into self-attention (in a graph transformer)?

Ask Question

Asked 3 days ago

Modified 3 days ago

Viewed 18 times

In the GT (Graph Transformer) model presented by Dwivedi & Bresson in "A Generalization of Transformer Networks to Graphs", the following equations are used to update node and edge representations at each layer ($h_i^\ell$ is the node representation of node $i$ and layer $\ell$, and $e_{ij}^\ell$ is an edge representation. The double vertical bar symbol refers to concatenation over different attention heads. All symbols which are capital letters are learnable weight matricies.):

but I am confused as to how $\hat{w}_{ij}^{k,\ell}$ is computed. Each pre-softmax attention weight $\hat{w}_{ij}^{k,\ell}$ should be a scalar, however the dot product $$Q^{k,\ell}h_i^\ell\cdot K^{k,\ell}h_j^\ell$$ is clearly a scalar, while $E^{k,\ell}e^\ell_{ij}$ is a vector, so if one is to interpret equation (12) as taking the dot product of the key and query representations, then multiplying this scalar by the edge representation vector, this yields a vector $\hat{w}_{ij}^{k,\ell}$ (when we want a scalar). Surely the authors do not intend $Q^{k,\ell}h_i^\ell\cdot K^{k,\ell}h_j^\ell\cdot E^{k,\ell}e^\ell_{ij}$ to mean some sort of three-way dot product (this is not well defined). What am I misunderstanding about equation (12)?

Beyond the question of this particular graph transformer model, I am also wondering more generally: what are the simplest ways one can incorporate edge type/edge feature information into a basic multi-head attention model (in which each node attends to all other nodes) for learning on graph data?

I have seen other papers (the Graphormer model) which do things like biasing the attention with an average of the edge representations taken over a shortest path between two nodes, or (the graphGPS model) combining global attention with an MPNN which uses edge features in message-passing. But I am looking for a more "minimalistic" approach to incorporating edge types. I want the simplest approach which is (in some non-rigorous/heuristic sense) the most natural extension of vanilla self-attention on node features to vanilla self-attention on edge-typed graphs; I am not concerned with finding the most powerful or high performing approach.

One way i was thinking of doing this is simply embedding the edge types into vectors with the same dimensionality as the node features, and simply running nn.MultiHeadAttention on the set of $\{$edge representations $\cup$ node representations$\}$, but the issue with this is that the model would have no way to distinguish between edge features and node features. (maybe I could solve this issue by having separate key, query, value weight matrices for the edge vectors and node vectors? idk.)

Bonus points if anyone can suggest a model which is very easily to implement using the pytorch nn.MultiHeadAttention, or if there is an available implementation on github.

edited Jul 1 at 10:27

asked Jul 1 at 10:24

td12345

112 bronze badges

New contributor

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Stack Exchange Network

Simplest way to incorporate edge types into self-attention (in a graph transformer)?

0

Your Answer

Browse other questions tagged
machine-learning
artificial-intelligence
or ask your own question.

Hot Network Questions

Simplest way to incorporate edge types into self-attention (in a graph transformer)?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged machine-learningartificial-intelligence or ask your own question.

Related

Hot Network Questions

Browse other questions tagged
machine-learning
artificial-intelligence
or ask your own question.