Scaled Dot-Product Attention module (Attention is all you need by Vaswani et al., 2017) with optional residual attention from previous layer (Realformer: Transformer likes residual attention by He et al, 2020) and locality self sttention (Vision Transformer for Small-Size Datasets by Lee et al, 2021)
## test w patches [bs *c_in x num_patches x d_model]d_model=512c_in =2num_patches =10x_emb = torch.randn(4*c_in,num_patches, d_model)abs_position = tAPE(d_model, seq_len=num_patches)x_emb_pos = abs_position(x_emb)model = Attention_Rel_Scl(d_model=d_model, n_heads=2, # number of attention heads seq_len=num_patches, # sequence length or num patches )out, attn_weights = model(x_emb)