Dataset insights, findings and validation strategy:

The dataset has several “properties”: I used some of them in my pipeline and they gave me a lift in the leaderboard but they unfortunately didn’t propel me to the top 3 at the end of the competition.

The competition description mentions that the dataset consists of sets/flights of 5 (or less) frames of different zooms, so the first step for the validation set is to keep the images from the same sets in either the validation or the training in order to avoid a leak between validation and training sets. But still, there is still quite a gap in the validation metric vs leaderboard metric, so I had to dig further.

For the segmentation competition, in order to have a better understanding of the classes and in an attempt to balance the validation / training in terms of classes – especially the humans, animals, snow and wire –, I started by writing a script to count the number of pixesl per image. I opened the table in excel and got a surprise when looking at the extreme and really low values.

Many images have an area less than 50 pixels that is allocated to a particular class but it is impossible for the human eye to annotate that area when you open the images in an annotation tool. It is easy to find 30+ mistakes in the annotations using that method: for example there is an image where about a quarter of the image is annotated as garden furniture.

During that process, and by looking at the actual image on the screen, I realized that on top of a really large number of annotation mistakes, there were some duplicates but at first, I didn’t realize the extent of it.

Now, if you consider the image names, you will notice that the name is composed of two strings xxxx and yyyy separated by a dash: xxxx-yyyy. By ordering the image names in the same set by alphabetical order, it appears that the largest view is always the first one of the set and the last image is also always the narrowest view so every set of 5 images has a large view, 3 intermediate views and a really narrow view (zoom)

There are (at least) two ways to use that property:
A. Separate the dataset in 5 view classes as mentioned above, run a classifier to predict the view class at inference time and then run a segmentation model dedicated to every view class: I tried but the results didn’t beat a single model most likely due to the small number of pictures in every view class.

B. create a dataset made of the largest views (which is about 358 pictures) and use it to find the duplicates. Using a script found on github, link here, it appears that there are a large number of duplicates in the dataset so you could in fact reduce the dataset to 139 sets of 5 images instead of the initial 358 sets. Some duplicates such as the garden with the wine have up to 7 duplicates (!) and I only looked at the duplicates without rotating the images.

Example of duplicated sets:
[1c34ab84dd6247f08c9a1b01cf5fba19-1621369308700015711.png’, ‘e96b49a584a040209fc4e0b9ba7cbb3f-1621370692100010852.png’, ‘0fbb3bd92b1a4cccbbfd10d5f345b273-1621371787200010506.png’, ‘db86f9a4c2fd4e60bcf24b057d08c305-1621364836400015192.png’, ‘019c9f9128464ab59a58f2e95f83d55a-1631898858600005621.png’, ‘77cd0ba0d82d4775b377e3e3f095fae0-1628011861000005388.png’,
‘Da4cc49f4507413981793b8479abd873-1628011281100006464.png’]

The funny thing is that by looking carefully, you will find at least two sets of images where the exact same roof has solar panels in one set and none in the other set with different lighting and different rotation.

Given this finding, the validation / training sets need to be split according to the duplicates so that there isn’t any leakage: having the same images in both validation / training distorts the metric. I chose to rebalance the dataset by replicating the images without duplicates or low number of duplicates and I managed to reduce the gap between the local validation and the leaderboard to about 4% towards the end of the competition on the segmentation competition and 6% on the depth one.

Again, now that we know the duplicates, it is easier to build a multi-fold validation set (in Jeremy Howard style – see his 2017 teaching videos) to find the annotation mistakes: I found approximately 300+ mistakes in the annotations where most of the correction was to change the given class to unknown/255. Unfortunately, that only gave me a little boost on the leaderboard (less than 2%). The “dirt” class became my nightmare as it was inconsistently annotated across the dataset and the wire class is one of the most difficult to annotate because it is difficult to differentiate the actual wire from the shade on a black and white picture but the model slightly improved by using mosaic images of animals, human and snow.

Other properties:

In the image name xxxx-yyyy, it looked like the first 6 digits of the yyyy string correspond to the time at which the images were taken so you can technically follow the path of the drone image after image. I didn’t find a way to directly use that property given the aicrowd setup at inference time.

In the depth images, there are some outlier values in the masks and for some reasons, most of these values (they can’t be removed) are mostly located on the border of the mask and in the left edge and toward the bottom left corner: is that a property of the sensor ? Duplicating these images with large values in training slightly helped the final score.

In the images, the primeair_pattern is always located towards the middle of the pictures, the wires are always towards the edges and there is always the same white van parked in the pictures. I thought about it but I didn’t find a way to integrate a correction script in my pipeline to leverage these findings given the limited inference time of 10s.

1 Like