🚀 Share your solutions! 🚀

Hi Everyone,

Thank you for participating in the Data Purchasing Challenge, it has been a unique journey for this one-of-a-kind challenge. We’re excited to know your ideas and solutions. Regardless of whether you won, any ideas you share on the discussion forum are highly appreciated. :smiley:

Sharing your solutions also helps you reflect upon your learnings throughout the competition. :rocket:

I’ll summarize all the solutions that are shared into this post, as I’m sure they’ll be tremendously useful for participants of future Data Purchasing Challanges (yes there will be more :wink:)

Please also share failed ideas, as all machine learning afficianados know, negative samples are just as important.

Looking forward to your solutions! :raised_hands:

1 Like

First of all I would like to say huge thank you to aicrowd for this unique and super fun challenge! Also congrats to other winners! I consider myself very lucky to land the first place and would love to share my solution and learnings here!

My solution is very simple and straightforward. It is basically “iteratively purchase the next batch of data with the best possible model until run out of purchase budget”. One of the biggest challenges for this competition, imho, is that you cannot get very reliable performance scores locally or on public leaderboard. So if we can filter out more noise from the weak signal of the scores, the chance of overfitting may be much lower. And during my experiments, I was more focused on simple strategies, mainly because more complex strategies require more tuning which means more decisions to make, and higher risk of overfitting (since everytime when making a decision, we may like to refer to the same local and public scores, over and over again).

OK, enough hypothesis and high level talk! Here’s the details (code):

Most importantly, the purchase strategy:

def decide_what_to_purchase(probpred_ary, purchased_labels, num_labels):
    """purchase strategy given the predicted probabilities"""

    oneminusprob = 1 - probpred_ary
    topk_prob_ind = np.argsort((oneminusprob * np.log(oneminusprob) + probpred_ary * np.log(probpred_ary)).mean(axis=1))
    topk_prob_ind = [x for x in topk_prob_ind if x not in set(purchased_labels)][:num_labels]
    return set(topk_prob_ind)

basically, select the most uncertain samples based on entropy.

And for each iteration, number of labels to purchase is decided on the fly given the compute and purchase budget:

    def get_iteration_numbers(self, purchase_budget, compute_budget):
        for ratio in range(30, 100, 5):
            ratio /= 100
            num_labels_list = self.get_purchase_numbers(purchase_budget, ratio=ratio)
            for epochs in ZEWDPCBaseRun.generate_valid_epoch_comb(num_labels_list):
                epoch_time = self.calculate_total_time_given_epochs(epochs, num_labels_list)
                #print("ratio and time takes!", ratio, epoch_time)
                if epoch_time <= compute_budget:
                    print(f"settle with ratio {ratio}")
                    return num_labels_list, epochs
        return [purchase_budget], [10]
    def get_purchase_numbers(self, purchase_budget, ratio):
        start_number = int(1000 * ratio)
        if start_number >= purchase_budget:
            return [purchase_budget]
        num_labels_list = [start_number]
        remain_budget = purchase_budget - start_number
        while remain_budget > 0:
            label_to_purchase = min(remain_budget, int(ratio * 1000))
            remain_budget -= label_to_purchase
        return num_labels_list

basically, we try to see if we can purchase 300 images for each iteration and get the purchase budget exhausted before we run out of time. If not, then we increase it to 350 images (so fewer iterations), and see if that works. And then increase to 400 images… And we do it for each iteration and only take the first element of the purchase list generated by the strategy. Namely, we may have decided to purchase 300 images for each iteration last round, and may increase that to 400 images this iteration. Mainly because we couldn’t accurately estimate the exact time it may take to train the next iteration model so would like to re-estimate each time if we can still finish in time. In fact, I did a moving average (with a extra 0.1 time buffer) to estimate how long it may take to train the next iteration.

self.train_time_per_image = self.train_time_per_image * 0.8 + (train_end_time - train_start_time) / len(tr_dset) / num_epochs * 0.3

Now within each iteration, we need to train the model.

The model I ended up with is

def load_model():
    """load a pretrained model"""
    model = models.regnet_y_8gf(pretrained=True)
    model.fc = nn.Linear(2016, 6)
    return model.to(device)

basically, the most complex model that is still reasonable to train.

Like any other computer vision problems, data augmentation is also very key:

self.train_transform = transforms.Compose([
transforms.ColorJitter(brightness=0.5, contrast=0.8, saturation=0.5, hue=0.5),
transforms.RandomAdjustSharpness(sharpness_factor=2, p=0.25),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
self.test_transform = transforms.Compose([
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),

Was trying to also do test time augmentation but I found that prediction takes too much time, and it might not be worth it.

And the optimizer training scheduler:

scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=len(dataloader)*NEPOCH, T_mult=1)

NEPOCH was a tuning parameter, I tried 5, 7, 10, 15. 5 or 7 didn’t seem to be enough, 15 seemed to be a bit too much, and 10 seemed to be pretty good.

So baiscally, the flow works like this:

  1. train the model, with data aug, 10 epochs, and 1 cycle of cosine annealing lr sheduler
  2. decide how many images to purchase based on compute and purchase budget
  3. do one round of prediction to get probabilities and then purchase the most uncertain images based on entropy
  4. collect the just purchased images, further train the model (load the model and optimizer checkpoints)
  5. repeat until no purchase budget left

Hopefully this is helpful! And please ask any questions if you have any!


Hello, I want to share my solution.

The competition was very interesting and unusual. And it was my first competition on AI crowd platform and guides/pages/discussions were very helpful for me. So thanks to organizers!!!

Actually my solution is very similar to xiaozhou_wang’s.

I have two strategies. First strategy is based on the idea to collect samples with “hard” classes (it went from Round 1). Suppose we have a trained model and we know F1-measure for all six classes from validation. Let us sum class predictions with weights equal to 1 - f1_validataion. And then choose samples with maximum of weighted predictions.

def choose_unlabelled_by_sum_probs(self, unlabelled_indices, unlabelled_preds, choose_size):
    assert len(unlabelled_indices) == len(unlabelled_preds)

    if len(unlabelled_indices) <= choose_size:
        return unlabelled_indices

    _, best_f1s = self.best_states['best_thrs_0']

    choose_scores = unlabelled_preds[:, 0] * (1 - best_f1s[0])
    for x in range(1, n_classes):
        choose_scores += unlabelled_preds[:, x] * (1 - best_f1s[x])
    sorted_indices = np.argsort(-choose_scores)
    return [unlabelled_indices[x] for x in sorted_indices[:choose_size]]

The second strategy is to collect samples with higher uncertainty. I consider the prediction 0.5 is the most uncertain, so I just sum the absolute value of 0.5 – over all classes.

def choose_unlabelled_by_uncertainty(self, unlabelled_indices, unlabelled_preds, choose_size):
    assert len(unlabelled_indices) == len(unlabelled_preds)

    if len(unlabelled_indices) <= choose_size:
        return unlabelled_indices

    _, best_f1s = self.best_states['best_thrs_0']

    choose_scores = np.sum(0.5 - np.abs(unlabelled_preds - 0.5), axis=1)
    sorted_indices = np.argsort(-choose_scores)
    return [unlabelled_indices[x] for x in sorted_indices[:choose_size]]

I also considered the third strategy from hosts: “match labels to target distribution”, but it was worse than without it. PS. to organizers – I have this code in my solution since I exprimented, but take very little samples by it and I think it doesn’t matter for score.

I tried several ratios of first strategies, but I didn’t see an obvious advantage of one of them. So finally I used both strategies with the equal budget.

I saw the idea of “Active Learning” in one of papers and decided to make several iterations (let’s say, L).

  1. Train a model with current known samples
  2. Take ~purchase_budget//L samples by two strategies (the last one batch can be bigger by 1).

The problem was to calculate the number L of iterations. My way is not so clever as xiaozhou_wang’s. I noticed that ~300 samples are enough for one iteration. Even more, in my experiments sometimes more iterations worsened a result. I looked at the submissions table to estimate training time and inference time. So I came to the formula (I have Pretraining Phase, so the first iteration doesn’t need training)

max_choose_size = min(len(unlabelled_dataset), purchase_budget)
n_loops = max(1, min(1 + (compute_budget - 50) // 220, int_ceil(max_choose_size, 290)))

For training I used efficientnet_b3, 5 epochs with

CosineAnnealingLR(optimizer, T_max=5, eta_min=1e-5)

and the following augmentations

return A.Compose([

    A.OneOf([A.GaussianBlur(), A.MotionBlur()], p=0.5),