Tips on creating a successful submission

nigel_carpenter · December 19, 2020, 1:45pm

OK, so after a bit of struggle I’ve just managed to successfully submit a first model that goes a bit beyond the getting the started notebooks. For the benefit of those that follow I thought I’d share back observations, tips and tricks.

# 1 Start with Colab

I’m new to AI Crowd, but experienced in Kaggle. This competition reminds me a bit of the code based kaggle competitions which, if you are a beginner to data science, can seem a bit confusing and daunting to get started with.

If that’s how you feel then I’d suggest you start with the colab notebooks. You can make your first submission by just running and submitting the code in the notebook with no changes other than to add your API key following the instructions at the beginning of the notebook.

From there you can make a few simple changes to the fit_model routine to gain confidence. My mistake was to immediately jump in with multiple complex changes and data manipulation. No wonder I then kept tripping over my own coding errors for the next 8 hours!

# 2 If you start with Python don’t give up after the first submission!

[EDIT: This issue has now be fixed . Thanks admins ]

My first preference is R… so my first submission was essentially the R getting started colab notebook. But as more people joined and made submissions I couldn’t understand why their RMSE was in the 1000s compared to my RMSE of 500 ish.

When I looked at the Python getting started notebook the answer became clear. The R and Python getting started notebooks are not consistent (which I think is a bit of a shame and an oversight which would be good for the organisers to fix). Basically the R notebook submits the observed mean from the training data (something like 114 per policy) whereas the Python notebook submits a prediction of 1000 per policy.

# 3 Using additional libraries in R colab

I struggled to correctly load additional libraries. In the end I worked out a solution by looking at the zip code submission method and the python colab notebook.

So, for R users wishing to use additional libraries like data.table and xgboost then as per the instructions add the install.packages(“data.table”) to the install_packages function. The part that is unclear is where do you add the call to load the libraries?

I’ve found that I had to add it to the beginning fit_model function call. For good measure I also added them to the predict_expected_claim and preprocess_X_data which I suspect is unnecessary but does no harm. I’ll come back and update when I confirm.

[EDIT: I’ve tried putting the library calls in the fit model function only but that caused submission errors. So until we get guidance from admins I’m sticking with adding them to all the key functions as above.]

# 4 Using zip file submission approach

I’ve not succeeded with this approach yet. This would be my preferred method and will be my focus now. Again I’ll come back and update when I get it working.

Good luck, everyone!

jyotish · December 19, 2020, 2:11pm

Hello @nigel_carpenter

This is a great post!

We also have code-based submission (zip file) examples for python and R at https://gitlab.aicrowd.com/aicrowd/insurance-pricing-game-starter-kit.

jeremiedb · December 19, 2020, 5:05pm

Regarding the zip file submission with R, the following error messages were returned in the submission page:

"DockerBuildError: The command '/bin/sh -c r -e \"`cat /tmp/install.R`\"' returned a non-zero code: 1"

and

" [bt] (6) /usr"

Although the install.R file in the .zip file only contains:

install.packages("data.table")
install.packages("xgboost")

The zip files had those files:

And running the following works locally:

Rscript predict.R training_data.csv output_predicted_claims.csv output_prices.csv

Any insight on some further considerations for zip file submission?

alfarzan · December 19, 2020, 5:17pm

Thanks @nigel_carpenter!

It’s awesome to see this kind of engagement
Hopefully we’ve addressed some of the concerns above.

R and Python notebooks

I have gone ahead and made sure that both R and Python notebooks have exactly the same model implemented. So if you go to the notebooks, copy them, put in your API key and run the whole thing, for both you will get exactly the same predictions now.

R and Python ZIP submissions

We’ve made two changes (but give it an hour to update):

Default models. Like the notebooks, now the files on the repo both have the same model (the mean model) implemented by default so that you can try a very first submission quickly.
New test files. We’ve added test.sh/ .bat files to make generating CSVs for your own testing easier with the zip submissions.

jyotish · December 19, 2020, 5:18pm

Hello @jeremiedb

DockerBuildError means that we were not able to install your dependencies.

The second error is a bit misleading. This is the complete traceback.

Error in doTryCatch(return(expr), name, parentenv, handler) :
  [16:58:16] ./include/xgboost/json.h:65: Invalid cast, from Null to Array
Stack trace:
  [bt] (0) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x73) [0x7f88839d4df3]
  [bt] (1) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::JsonArray const* xgboost::Cast<xgboost::JsonArray const, xgboost::Value>(xgboost::Value*)+0x25c) [0x7f8883a6a05c]
  [bt] (2) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::RegTree::LoadModel(xgboost::Json const&)+0x3ab) [0x7f88839f94bb]
  [bt] (3) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::gbm::GBTreeModel::LoadModel(xgboost::Json const&)+0x3d9) [0x7f88839f8c89]
  [bt] (4) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::gbm::GBTree::LoadModel(xgboost::Json const&)+0x183) [0x7f88839f8fe3]
  [bt] (5) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::LearnerIO::LoadModel(xgboost::Json const&)+0x4d1) [0x7f8883abea81]
  [bt] (6) /usr

nigel_carpenter · December 19, 2020, 6:04pm

Hi @jyotish I hit the same [bt] (6) error as @jeremiedb earlier in the day. I too was loading an R script that had a dependency on xgboost. The code was running fine locally and was essentially the same as my earlier successful colab script.

Would you agree that the error looks like an issue in your zip submission environment? Maybe a problem loading or compiling the xgboost package?

I suspect xgboost (or lightGBM & CATBOOST) will be popular packages for this competition. Perhaps you could amend the getting started notebooks to a stage where they at least install and load these libraries?

I ask that because I suspect it will be easier for you to fix than it is for us users to troubleshoot. We don’t have the same experience in your setup or access to the backend messages and infrastructure.

Must add that I greatly appreciate that you are online and responding to queries!

mohanty · December 19, 2020, 6:21pm

@jyotish: Can we basically have a custom base image where all the common dependencies are already installed ?
It will hopefully both reduce the image build times and also help us avoid back and forth many such dependency related issues.

jyotish · December 19, 2020, 6:28pm

@mohanty @nigel_carpenter

We already preparing a base image for R that has a large number of packages pre-installed. The package installation is taking very long. Will make an announcement as soon as things are ready with the pre-installed package list.

jyotish · December 19, 2020, 6:52pm

@nigel_carpenter

These packages will be pre-installed for R based submissions: Packages available in base environment for R

Let us know if you think we need to add any other package.

mangoloco69 · December 19, 2020, 7:48pm

Maybe I missed something, but how about e.g. data.table , dplyr?
Guessing you intentionally left them out, not seeing them as “base image”?
Thanks

jyotish · December 19, 2020, 7:58pm

Hello @mangoloco69

They were not left out intentionally. We basically installed the packages from https://cran.r-project.org/web/views/MachineLearning.html.

You can still specify the packages you want to install

In your install.R file if you are creating the zip files or
In install_packages function if you are using colab notebook.

dplyr and data.table seem to take only a few seconds to install on top of the base image. Can you try including them from your side?

mangoloco69 · December 19, 2020, 8:07pm

I will include them from my side, no problem. Heavy packages are more important from your side, indeed. Thanks

jeremiedb · December 20, 2020, 6:33pm

@jyotish I’ve been unsuccessful giving another shot at the R zip submission following the update at:

It resulted in the same error message, although running the test.bat works fine.
The only thing I could note was that I had to add the line :
set WEEKLY_EVALUATION=false prior to the first call to predict in order to generate the claims.csv file.

Otherwise, I can’t see what could be wrong. Packages are installed in install.R and loaded where indicated in model.R (even tried to load them within the fit functions which was needed in the Colab submissions - still no success). Have you performed a test on the template to validate it works when using packages? It would be useful to get a working example of scripts that use packages.

nigel_carpenter · December 20, 2020, 7:43pm

@jeremiedb one idea that may be worth trying… I downloaded one of my colab submissions and noticed I got a zip file that seems to look very similar to the format that the zip file submission needs to be in.

Made me wonder if you can then successfully submit this zip file? If yes; can you, through inspection, work out what the contents of the zip need to be like to create a successful zip file submission?

jeremiedb · December 21, 2020, 12:49am

Good hint!
I finally managed to get a R zip submission to work.

I’m unsure which moving part was critical, having all of the following does seem to work:

config.json

{"language": "r"}

Install.R

install_packages <- function() {
  install.packages("data.table")
  install.packages("xgboost")
}

install_packages()

And then have the the various functions split into seperate files as listed in the source call in the predict.R file:

source("fit_model.R")  # Load your code.
source("load_model.R")
source("predict_expected_claim.R")
source("predict_premium.R")
source("preprocess_X_data.R")

# This script expects sys.args arguments for (1) the dataset and (2) the output file.
output_dir = Sys.getenv('OUTPUTS_DIR', '.')
input_dataset = Sys.getenv('DATASET_PATH', 'training_data.csv')  # The default value.
output_claims_file = paste(output_dir, 'claims.csv', sep = '/')  # The file where the expected claims should be saved.
output_prices_file = paste(output_dir, 'prices.csv', sep = '/')  # The file where the prices should be saved.
model_output_path = 'trained_model.RData'

args = commandArgs(trailingOnly=TRUE)

if(length(args) >= 1){
  input_dataset = args[1]
}
if(length(args) >= 2){
  output_claims_file = args[2]
}
if(length(args) >= 3){
  output_prices_file = args[3]
}

# Load the dataset.
# Remove the claim_amount column if it is in the dataset.
Xraw = read.csv(input_dataset)

if('claim_amount' %in% colnames(input_dataset)){
  Xraw = within(Xraw, rm('claim_amount'))
}


# Load the saved model, and run it.
trained_model = load_model(model_output_path)

if(Sys.getenv('WEEKLY_EVALUATION', 'false') == 'true') {
  claims = predict_premium(trained_model, Xraw)
  write.table(x = claims, file = output_claims_file, row.names = FALSE, col.names=FALSE, sep = ",")
} else {
  prices = predict_expected_claim(trained_model, Xraw)
  write.table(x = prices, file = output_prices_file, row.names = FALSE, col.names=FALSE, sep = ",")
}

simon_coulombe · December 21, 2020, 3:34am

I have a couple questions if you guys haven’t given up yet :

I’m not sure where to put the library() calls.

Also, my preprocessing involves a recipes::recipe() that wrangles the data and creates the dummy variables. I’d need to attach that ‘trained recipe’ to the submission, or re-train it everytime from the original “Training.csv”. Is it possible to attach more files to the submissions, like a my_trained_recipe.Rdata file?

cheers

jeremiedb · December 21, 2020, 3:59am

For the zip submission, I went for the shotgun approach and likely loaded the libraries at too many places, put at least the following does work. I added require() / library() right after the beginning of each of the functions preprocess_X_data, predict_expected_claim, predict_premium and fit_model, for example:

preprocess_X_data <- function (x_raw){
  require("data.table")
  require("xgboost")
  ...

Note that for the zip submission, no need to include the fit_model.R, it really seems like all that is necessary are the files invoked by the predict.R (so the model is directly loaded from trained_model.RData.

jyotish · December 21, 2020, 4:37am

Hello @nigel_carpenter

Indeed. When you try to submit via colab notebook, we are essentially creating a zip file and submitting that zip file to AIcrowd. For both python and R based submissions, you can see a directory submission_dir and submission.zip file on your colab notebook. The submission.zip file is the exact file that is submitted.

jyotish · December 21, 2020, 4:41am

Hello @jeremiedb

We did test the starter kit with a few packages installed. If you are still facing this issue, we would be happy to debug your submission.

Regarding including the packages, one starting point would be to include/load the library/packages when one of it’s functions is invoked in the subsequent steps.

However, if you are preparing the zip file yourself, you are free to organize your code as you like. If the test.bat file works for you locally, it is expected to work during the evaluation as well. Can you share the submission ID where the submission worked locally for you but failed during evaluation?

jyotish · December 21, 2020, 4:46am

Hello @simon_coulombe

You can include any file that you want in your submission.zip file. However, you might want to check this post, Question about data, on including the training data during evaluation.

@jeremiedb Yes, the fit_model is not needed for evaluation ~~and is optional to submit. But it would be good if you can include the code used for your training so that it becomes easier for us to validate the submissions some time later.~~
But the fit_model function should be submitted along with your prediction code for our validation.