Question about the architecture (graphTransformer) #87

Forbu · 2024-01-30T14:58:51Z

I was looking at your implementation of attention here :
https://github.com/cvignac/DiGress/blob/main/src/models/transformer_model.py#L158

I have some question about the code :

Q = Q.unsqueeze(2)  # (bs, 1, n, n_head, df)
K = K.unsqueeze(1)  # (bs, n, 1, n head, df)

# Compute unnormalized attentions. Y is (bs, n, n, n_head, df)
Y = Q * K

Here I have a question because in the classic attention mecanism we have Y which have a dimension of (bs, n, n, n_head) not feature specific. I don't know if this what the author wanted (this is not proper outer product this is element wise multiplication).

Also a few line after we have :

attn = masked_softmax(Y, softmax_mask, dim=2)  # bs, n, n, n_head
print("attn.shape : ", attn.shape) # i add this

As the attention shape I obtain (bs, n, n, n_head, df) dimension (contrary to the comment).
The code is not really implementing "real" graph transformer attention like other code like :
https://docs.dgl.ai/_modules/dgl/nn/pytorch/gt/egt.html#EGTLayer

But as your code give me better results than the one above (with a proper attention mecanism) I wonder if this is something that the authors made intentionnally.

cvignac · 2024-01-30T16:58:18Z

Copying the answer from #47 your observation is correct. It’s not exactly the standard attention mechanism. I’ve not thoroughly compared the two, but current code was written on purpose. The reason for this is that we have to manipulate features of size (bs, n, n, de) anyway, so using vector attention scores instead of scalar does not create a strong memory bottleneck. I would be interesting to investigate this further, though. Clement Le mar. 30 janv. 2024 à 14:59, Adrien B ***@***.***> a écrit :

…

I was looking at your implementation of attention here : https://github.com/cvignac/DiGress/blob/main/src/models/transformer_model.py#L158 I have some question about the code : Q = Q.unsqueeze(2) # (bs, 1, n, n_head, df)K = K.unsqueeze(1) # (bs, n, 1, n head, df) # Compute unnormalized attentions. Y is (bs, n, n, n_head, df)Y = Q * K Here I have a question because in the classic attention mecanism we have Y which have a dimension of (bs, n, n, n_head) not feature specific. I don't know if this what the author wanted (this is not proper outer product this is element wise multiplication). Also a few line after we have : attn = masked_softmax(Y, softmax_mask, dim=2) # bs, n, n, n_headprint("attn.shape : ", attn.shape) # i add this As the attention shape I obtain (bs, n, n, n_head, df) dimension (contrary to the comment). The code is not really implementing "real" graph transformer attention like other code like : https://docs.dgl.ai/_modules/dgl/nn/pytorch/gt/egt.html#EGTLayer But as your code give me better results than the one above (with a proper attention mecanism) I wonder if this is not something that the authors made intentionnally. — Reply to this email directly, view it on GitHub <#87>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEJOOTTRSBNRAKC3ZHJCDQ3YREDDPAVCNFSM6AAAAABCRM4D6CVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYDQMBWHEZTQNA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- *Clément Vignac*

Forbu · 2024-01-30T19:53:31Z

I am doing some experiment on my own graph dataset. Your implementation seems to be more performant that the standard graph transformer (at least the one I tried from DGL library). Yours clearly achieve to generate more plausible edges.
I am doing more experiements to confirm this (I currently only have "visual" clues and noisy loss curves to back this affirmation).

Your implementation is equivalent of having a classic graph transformer but with as many head as original dimension, so you ends up having heads of only one dimension (I mean if df = 1 you will obtain the same results).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the architecture (graphTransformer) #87

Question about the architecture (graphTransformer) #87

Forbu commented Jan 30, 2024 •

edited

Loading

cvignac commented Jan 30, 2024 via email

Forbu commented Jan 30, 2024 •

edited

Loading

Question about the architecture (graphTransformer) #87

Question about the architecture (graphTransformer) #87

Comments

Forbu commented Jan 30, 2024 • edited Loading

cvignac commented Jan 30, 2024 via email

Forbu commented Jan 30, 2024 • edited Loading

Forbu commented Jan 30, 2024 •

edited

Loading

Forbu commented Jan 30, 2024 •

edited

Loading