My solution for the challenge


Thank you to the hosts for hosting this interesting competition and special thanks to @dipam :slight_smile:

My source code for the competition:

I have used the CLIP ViT-H model with Product-10k, H&M, Shopee, and the Amazon dataset @bartosz_ludwiczuk provided. The hyperparameters for training are mostly taken from GUIE-4th Place.

I feel like most of us used the same datasets and models but nevertheless, I am interested in others’ solutions maybe we can use this thread to post our solutions.


I would say that your solution is quite similar to what I had at some point during this competition.

  • was using the same repo as baseline, mean 4th place from Universal Embeddings
  • also Vit-H (not sure if you checked the weights from this repo, but just using it I could get ~0.56 in round-1, without any training)

Also, what about post-processing, did you use any technique like Database-side augmentation?
These post-processing techniques could boost my score in Round 1 from ~0.64 to 0.67.

I’m also gonna describe my solution in blog-post (to describe my whole journey) then we can compare our solutions:)


I also used query expansion with a pagerank inspired thresholding, it gave a .03 boost to the score.
I want to ask how did you fit the ViT-H model in the 10 min inference, I tried using it through timm, and jit compiled, but could never fit it in time.

About ViT-H, based on your code you were using ViT-H while training, so I understand that you have used it in inference, am I right?

About using ViT-H for inference, it was not a big deal for me, it was just working without any issue, The code for Vit-H looks like that (I used jit trace as it is faster than script TorchScript: Tracing vs. Scripting - Yuxin's Blog)

self.model_scripted = torch.jit.load(model_path).eval().to(device=device_type)
gallery_dataset = SubmissionDataset(
            root=self.dataset_path, annotation_file=self.gallery_csv_path,

        query_dataset = SubmissionDataset(
            root=self.dataset_path, annotation_file=self.queries_csv_path,
            transforms=get_val_aug_query(self.input_size), with_bbox=True

        datasets = ConcatDataset([gallery_dataset, query_dataset])
        combine_loader =
            datasets, batch_size=self.batch_size,
            shuffle=False, pin_memory=True, num_workers=self.inference_cfg.num_workers
        )'Calculating embeddings')
        embeddings = []
        with torch.cuda.amp.autocast():
            with torch.no_grad():
                for i, images in tqdm(enumerate(combine_loader), total=len(combine_loader)):
                    images =
                    outputs = self.model_scripted(images).cpu().numpy()

Forgot to mentioned, I was using PyTorch 2.0 and it was game changer for both, training and inference time.


Unfortunately, I couldn’t do any post or pre-processing during the inference because ViT-H was too slow. I tried to improve the speed of my models by pruning and exporting them to ONNX, and also by using jit compilation, but neither of these methods resulted in a noticeable speed-up. It’s possible that I did something wrong, or that any improvements were only visible on the CPU rather than the GPU.

I was planning to train a CLIP model with the Amazon dataset since they have product descriptions but I didn’t have any time to experiment with it.

We attempted to incorporate multiple external datasets into our experiments, spending considerable time trying to train our VIT-H model jointly with Product10k and other datasets, as well as training on other datasets and fine-tuning on Product10k. Surprisingly, despite our efforts, our current leaderboard score was achieved only by using the Product10k dataset; all other datasets resulted in a decrease in our score.

To improve our results, we utilized re-ranking for postprocessing, which gave us a marginal improvement of approximately 0.01%. Additionally, we experimented with convnext and VIT-G models, which boosted our local score by about 0.03%. However, even with the use of TensorRT, our models were unable to pass inference in 10 minutes.

@bartosz_ludwiczuk, Congratulations on achieving second place! I’m looking forward to reading about your winning solution.

Did you convert the model to TensorRT before or during the inference? Because our final solution uses convnext xxlarge with image size of 256 and embedding of dimm 2048.
To do this, we used nvidia’s docker image with half precision and reranking on GPU

We have tried TRT before inference in our server. We also use reranking on GPU, but maybe this was longer…
even with VIT-H + reranking our solution was almost 10min, cause in some cases it failed and in some cases run successfully, depending on hardware.