What happens if you Q,K,V = mlp(x).split(3) instead of linear(x).split(3) ? Anyone tried this?
88,05K