It’s awesome to see this kind of engagement
Hopefully we’ve addressed some of the concerns above.
R and Python notebooks
I have gone ahead and made sure that both R and Python notebooks have exactly the same model implemented. So if you go to the notebooks, copy them, put in your API key and run the whole thing, for both you will get exactly the same predictions now.
R and Python ZIP submissions
We’ve made two changes (but give it an hour to update):
Default models. Like the notebooks, now the files on the repo both have the same model (the mean model) implemented by default so that you can try a very first submission quickly.
New test files. We’ve added test.sh/ .bat files to make generating CSVs for your own testing easier with the zip submissions.
Hi @jyotish I hit the same [bt] (6) error as @jeremiedb earlier in the day. I too was loading an R script that had a dependency on xgboost. The code was running fine locally and was essentially the same as my earlier successful colab script.
Would you agree that the error looks like an issue in your zip submission environment? Maybe a problem loading or compiling the xgboost package?
I suspect xgboost (or lightGBM & CATBOOST) will be popular packages for this competition. Perhaps you could amend the getting started notebooks to a stage where they at least install and load these libraries?
I ask that because I suspect it will be easier for you to fix than it is for us users to troubleshoot. We don’t have the same experience in your setup or access to the backend messages and infrastructure.
Must add that I greatly appreciate that you are online and responding to queries!
@jyotish: Can we basically have a custom base image where all the common dependencies are already installed ?
It will hopefully both reduce the image build times and also help us avoid back and forth many such dependency related issues.
We already preparing a base image for R that has a large number of packages pre-installed. The package installation is taking very long. Will make an announcement as soon as things are ready with the pre-installed package list.
@jyotish I’ve been unsuccessful giving another shot at the R zip submission following the update at:
It resulted in the same error message, although running the test.bat works fine.
The only thing I could note was that I had to add the line : set WEEKLY_EVALUATION=false prior to the first call to predict in order to generate the claims.csv file.
Otherwise, I can’t see what could be wrong. Packages are installed in install.R and loaded where indicated in model.R (even tried to load them within the fit functions which was needed in the Colab submissions - still no success). Have you performed a test on the template to validate it works when using packages? It would be useful to get a working example of scripts that use packages.
@jeremiedb one idea that may be worth trying… I downloaded one of my colab submissions and noticed I got a zip file that seems to look very similar to the format that the zip file submission needs to be in.
Made me wonder if you can then successfully submit this zip file? If yes; can you, through inspection, work out what the contents of the zip need to be like to create a successful zip file submission?
I have a couple questions if you guys haven’t given up yet :
I’m not sure where to put the library() calls.
Also, my preprocessing involves a recipes::recipe() that wrangles the data and creates the dummy variables. I’d need to attach that ‘trained recipe’ to the submission, or re-train it everytime from the original “Training.csv”. Is it possible to attach more files to the submissions, like a my_trained_recipe.Rdata file?
For the zip submission, I went for the shotgun approach and likely loaded the libraries at too many places, put at least the following does work. I added require() / library() right after the beginning of each of the functions preprocess_X_data, predict_expected_claim, predict_premium and fit_model, for example:
preprocess_X_data <- function (x_raw){
require("data.table")
require("xgboost")
...
Note that for the zip submission, no need to include the fit_model.R, it really seems like all that is necessary are the files invoked by the predict.R (so the model is directly loaded from trained_model.RData.
Indeed. When you try to submit via colab notebook, we are essentially creating a zip file and submitting that zip file to AIcrowd. For both python and R based submissions, you can see a directory submission_dir and submission.zip file on your colab notebook. The submission.zip file is the exact file that is submitted.
We did test the starter kit with a few packages installed. If you are still facing this issue, we would be happy to debug your submission.
Regarding including the packages, one starting point would be to include/load the library/packages when one of it’s functions is invoked in the subsequent steps.
However, if you are preparing the zip file yourself, you are free to organize your code as you like. If the test.bat file works for you locally, it is expected to work during the evaluation as well. Can you share the submission ID where the submission worked locally for you but failed during evaluation?
You can include any file that you want in your submission.zip file. However, you might want to check this post, Question about data, on including the training data during evaluation.
@jeremiedb Yes, the fit_model is not needed for evaluation and is optional to submit. But it would be good if you can include the code used for your training so that it becomes easier for us to validate the submissions some time later. But the fit_model function should be submitted along with your prediction code for our validation.
Thanks @jyotish. Following the above steps, the zip file submission now works.
Something I noticed from downloading my Colab submission was that there seems to be a reversal between claims and prices at the end of the predict.R file:
claims = predict_premium(trained_model, Xraw)
Not sure if this would have been a problem I introduced or if it related to the template linked to the submit.sh utility.
On another note, leaderboard seems to be based on my worst submission (id 110383) instead of the best or most recent ( 110384, 110389)
The important thing in the script is that we invoke predict_premium function during the weekly leaderboard evaluation and predict_expected_claim for RMSE leaderboard evaluation. We will fix the variables names.
I thought that was rather cruel when I saw it! If only all the others that are rapidly catching my RMSE score could oblige and do the same. It would take the heat off me to find a better submission!