Thought I might share my solution for the Mouse Triplets task, in case someone finds it interesting.
Bert style encoder, masked span modelling similar to SpanBERT. I preferred masked modelling because autoregressive/seq2seq wouldn’t have both past and future context. After some experimentation I chose a window size of 80 frames.
Separate input sequences for each mouse, separated by a SEP token, so that the model is able to attend over actions of other mice in order to solve masked modelling.
Having separate sequences gives us an embedding for each mouse, we can use these to predict features for them (speed, body length etc).
We then can create representations between mice by taking the embedding for one and taking it away from another. From this embedding we can predict features (distance between the mice, angle from one to another etc).
We pool all the outputs using attention, and apply a contrastive objective, where positive pairs are two sub-sequences randomly sampled from the same sequence, negative pairs are other members of the batch. We can also predict “global” features using this representation (area of triangle formed by the mice).
To create the final embeddings we create overlapping windows, so that all of the embeddings have additional context, we then concatenate the embeddings for each mouse.
Congratulations to all the winners and participants, and thank you to the organisers for an interesting problem!
Please note: The code could do with some cleaning up which I haven’t got round to yet.