Hi! As you know Round 2 of the Data Purchasing Challenge went live recently. To discuss the motivation for this challenge and share key updates + baseline for the new round, we conducted a live community event. Top participants also presented their approach and baseline. Watch the Town Hall Here 🎥 Data Purchasing Challenge Town Hall | How to apply AI to choose which training data to purchase? - YouTube
The Town Hall covered the following
Challenge organiser @dominik.rehse presented ideas on how to develop successful purchasing functions for the new round. You can view and download his presentation over here.
@dipam from AIcrowd Research talks about the Data Shapley metric and the state of the art methods for data valuation. He also gave a walk-through for the new and revised baseline for Round 2. You can find the paper Dipam discussed over here. Here’s the baseline he explained.
@gaurav_singhal explained his new baseline + explainer for round 2. You can find the notebook over here.
@leocd shared his experience from the first round and shared the models & approach. You can find his presentation over here.
@sergey_zlobin also shared his thought process and approach from the previous round. Explore his presentation over here.
Data Purchasing Town Hall panelists!
You can find all the resources, presentations and research papers over here.
Have questions about the new round or need troubleshooting, drop the question in the comment below
thanks for putting this online! I totally didn’t assume labels were noisy. When looking at some images I did wonder where for example some dents were supposed to be, but because the data was generated synthetically I just assumed labels would be 100% correct. Definitely going to take this into account now.
Here are some links to interesting academic research, which might be a good starting point to develop clever purchasing functions.
Purchase labels in one go
One way to build a “label shopping list” is data valuation. You could determine the value of individual image-label pairs in the training set and then somehow extrapolate to the unlabeled images, for instance, in terms of the “expected value” from having an image being labelled.
Based on Shapley values
@dipam did introduce the idea of using Shapley values to estimate the in-sample value of a particular image-label pair. The respective research papers, code examples and videos can be found here:
Ghorbani, Amirata, and James Zou. “Data shapley: Equitable valuation of data for machine learning.” International Conference on Machine Learning. PMLR, 2019: Paper, Video, Code
Jia, Ruoxi, et al. “Towards efficient data valuation based on the shapley value.” The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019. Paper
Ghorbani, Amirata, Michael Kim, and James Zou. “A distributional framework for data valuation.” International Conference on Machine Learning. PMLR, 2020. Paper, Code
Kwon, Yongchan, Manuel A. Rivas, and James Zou. “Efficient computation and analysis of distributional Shapley values.” International Conference on Artificial Intelligence and Statistics. PMLR, 2021. Paper, Code
Based on reinforcement learning
Another way to do in-sample data valuation is to frame it as a reinforcement learning problem. The related research paper can be found here:
- Yoon, Jinsung, Sercan Arik, and Tomas Pfister. “Data valuation using reinforcement learning.” International Conference on Machine Learning. PMLR, 2020. Paper, Blog post, Code
Iteratively purchase labels
Purchasing labels iteratively is its own area of machine learning research: active learning. The Wikipedia page provides a reasonably good entry point to this literature. You might also want to check Github for relevant repositories, some with plenty of stars and/or references to practitioner handbooks (e.g. 1, 2)…
And if you stumble over more interesting approaches, please share to qualify for our cool community prizes!
I tried this method in round 1 (locally) and it worked pretty well:
It’s sold as an active learning method, but really does select labels in one go. However it really is essential that it uses a model that was trained in an unsupervised fashion, like facebook’s Dino. I tried using the vision transformer that came with torchvision or an efficientnet that was finetuned on the given data. Both didn’t work. Since dino is not among the supported pretrained weights it’s not an option in this competition.
I also think while it may work, it’s likely not the best performing method.
I tried this locally too!
but still beaten by buying naive prediction on dent label