Thank you for participating in the Multi Agent Behaviour Challenge 2022.
We would love to know the solutions you tried for this challenge. For both video and tracking data.
Select all the methods that you tried, doesn’t matter if it wasn’t your final solution.
LSTM (or other RNNs)
CNNs
Transformers
Graph Neural Nets
Others (Mention below)
0voters
We would love to hear the details of your solutions. This also gives a good opportunity for you to organise your work and contribute to the research community. Every team sharing their detailed solution in the comments below, before August 14th 2022, will get $200 AWS credits.
First I have tried to use basic CNN’s and a basic neural networks but it didnt work very well. Then I have tried RNN’s (LSTM). it works much better but it wasnt enough. Then I have tried Transformers and added attention layers. It works. Now I am working to improve model accuracy.
Our main solution consisted of three parts: A large pre-trained vision transformer model (microsoft/beit-large-patch16-512 · Hugging Face), a modified version of the baseline SimCLR model, and a large number of hand-crafted features (using the keypoints). These were combined by weighted PCA, where the weight was both column-wise (with different weights given to the three parts above), and row-wise (with more weight given to frames with a lot of movement).
My solution was made early on in the competition and uses only video data. It held the first place initially, though got surpassed later on. In hindsight I think I could have benefitted from using the keypoints as well.
I used an ensemble of pre-trained vision models by concatenating the output features of the vision models (resnet18 and MobileNetV3-Small). This results in a large vector which is then reduced to size 64 by PCA for the beetles challenge, which is the final embedding. For the mouse part of the challenge I instead reduce the size down to 32 in a similar way, and then concatenate to this the difference of the feature vector from 40 frames in the past and 40 frames in the future (window size of 80 as in the solution of @edhayes1) to also have dynamic information present in the embedding.