Top teams solutions

First of all, I’d like to extend my thanks to the organizers for putting together such a good and interesting competition, and for providing a solid baseline code.

I also want to encourage other top teams to share their solutions. It’s through this sharing of knowledge and techniques that we all grow and learn new things, which was one of the main reasons we participated in the first place.

So, let’s dive into our solution:

We began by assembling a dataset of around 300,000 images sourced from Airbnb. These images underwent processing to extract features that allowed us to distinguish between interior and outdoor scenes. We then filtered out only the indoor images. For each indoor image, we generated segmentation masks and depth maps. Additionally, we created an empty room image for each using an inpainting pipeline.

With the llava1.5 model, we generated image descriptions. Following the creation of our dataset, we trained two custom Control nets based on depth and segmentation-conditioned images. During inference, we utilized the inpainting pipeline along with the two custom Control nets and the IP Adapter.

For those interested in replicating our work, the full code is available on GitHub, and here is a submission code.

Furthermore, our final best model can be accessed and played with here.

If you find our model space and solution intriguing, we would be grateful for likes on Hugging Face as well as on GitHub. Your support means a lot to us.


:clap:Nice use of an external dataset and converting it to this challenge format. I did explore different variations at the time of inference using prompt engineering. I used a better segmentation model (swin-base-IN21k) and modified the control items with pillars as well for better geometry along with different prompt engineering techniques. Even though baseline gave me a better score, it is really inconsistent. Finally, I submitted a realistic vision model from Comfort UI, which gave stable and consistent results, and based on the human evaluations I did expect some kind of randomness in the leaderboard. I would like to express my gratitude to the organizers of this challenge. The challenge is new and exciting, but because there are only 40 images in the test dataset, the human evaluations are much worse and inconsistent. It was really fun exploring stable diffusion models and their adapters. When I have more processing power, I want to work on this in the near future.


Impressive! The contest was really interesting. I was not able to submit,I was able to generate some pretty pictures :slight_smile: Very very inconsistent pipeline though!
Approach: Direct SDXL generation with MLSD controlnet (0.7) - Negative prompt - Doors/ Windows. Used OneFormer to detect floors and cleaned the line segments on the floor. Used the given seg controlnet to inpaint back the windows and doors using SD1.5 (+ MLSD controlnet + IP style adapter). The model initiation time was wonky ranging from 60s to 190s on A10.


As now official result are released, here is the blog post for 2nd place solution: