Challenge organiser @dominik.rehse presented ideas on how to develop successful purchasing functions for the new round. You can view and download his presentation over here.
@dipam from AIcrowd Research talks about the Data Shapley metric and the state of the art methods for data valuation. He also gave a walk-through for the new and revised baseline for Round 2. You can find the paper Dipam discussed over here. Here’s the baseline he explained.
thanks for putting this online! I totally didn’t assume labels were noisy. When looking at some images I did wonder where for example some dents were supposed to be, but because the data was generated synthetically I just assumed labels would be 100% correct. Definitely going to take this into account now.
Here are some links to interesting academic research, which might be a good starting point to develop clever purchasing functions.
Purchase labels in one go
One way to build a “label shopping list” is data valuation. You could determine the value of individual image-label pairs in the training set and then somehow extrapolate to the unlabeled images, for instance, in terms of the “expected value” from having an image being labelled.
Based on Shapley values
@dipam did introduce the idea of using Shapley values to estimate the in-sample value of a particular image-label pair. The respective research papers, code examples and videos can be found here:
Ghorbani, Amirata, and James Zou. “Data shapley: Equitable valuation of data for machine learning.” International Conference on Machine Learning. PMLR, 2019: Paper, Video, Code
Jia, Ruoxi, et al. “Towards efficient data valuation based on the shapley value.” The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019. Paper
Ghorbani, Amirata, Michael Kim, and James Zou. “A distributional framework for data valuation.” International Conference on Machine Learning. PMLR, 2020. Paper, Code
Kwon, Yongchan, Manuel A. Rivas, and James Zou. “Efficient computation and analysis of distributional Shapley values.” International Conference on Artificial Intelligence and Statistics. PMLR, 2021. Paper, Code
Based on reinforcement learning
Another way to do in-sample data valuation is to frame it as a reinforcement learning problem. The related research paper can be found here:
Yoon, Jinsung, Sercan Arik, and Tomas Pfister. “Data valuation using reinforcement learning.” International Conference on Machine Learning. PMLR, 2020. Paper, Blog post, Code
Iteratively purchase labels
Purchasing labels iteratively is its own area of machine learning research: active learning. The Wikipedia page provides a reasonably good entry point to this literature. You might also want to check Github for relevant repositories, some with plenty of stars and/or references to practitioner handbooks (e.g. 1, 2)…
And if you stumble over more interesting approaches, please share to qualify for our cool community prizes!
I tried this method in round 1 (locally) and it worked pretty well:
It’s sold as an active learning method, but really does select labels in one go. However it really is essential that it uses a model that was trained in an unsupervised fashion, like facebook’s Dino. I tried using the vision transformer that came with torchvision or an efficientnet that was finetuned on the given data. Both didn’t work. Since dino is not among the supported pretrained weights it’s not an option in this competition.
I also think while it may work, it’s likely not the best performing method.