Trying something different under the goal of ‘working in public’, I’ve made a sort of video diary of the first few hours on this challenge that lead to my current entry. It’s up on YouTube: Come along for the ride as I dive into an ML contest (MABe 2 on AICROWD) - YouTube (Skip to 18:00 where I decide to focus on this challenge).
To summarise the approach:
- Feed a sequence of frames into a Perceiver model (eg 120 frames @ 10fps) and then have it try to predict N frames ahead. This works as a self-supervised task.
- Extract the latent array from the Perceiver and use this to derive our embeddings.
- Combine these with some hand-crafted features like body angles, mouse separation etc
Using a latent dimension of 40 for the perceiver, training very briefly and using just the reduced latent array as the representation submitted: 18/0.128. Using JUST the hand-crafted features: 18/0.135. Combining both with a few extra sprinkles: 13/0.173.
I can see that the trick with this contest is going to be packing that 100-dim representation with as many useful values as possible. What ideas do you have? Any other interesting SSL approaches? And has anyone had luck using the demo tasks to create useful embeddings? Let’s brainstorm