Tips on creating a successful submission

alfarzan · December 19, 2020, 5:17pm

It’s awesome to see this kind of engagement
Hopefully we’ve addressed some of the concerns above.

R and Python notebooks

I have gone ahead and made sure that both R and Python notebooks have exactly the same model implemented. So if you go to the notebooks, copy them, put in your API key and run the whole thing, for both you will get exactly the same predictions now.

R and Python ZIP submissions

We’ve made two changes (but give it an hour to update):

Default models. Like the notebooks, now the files on the repo both have the same model (the mean model) implemented by default so that you can try a very first submission quickly.
New test files. We’ve added test.sh/ .bat files to make generating CSVs for your own testing easier with the zip submissions.

jyotish · December 19, 2020, 5:18pm

Hello @jeremiedb

DockerBuildError means that we were not able to install your dependencies.

The second error is a bit misleading. This is the complete traceback.

Error in doTryCatch(return(expr), name, parentenv, handler) :
  [16:58:16] ./include/xgboost/json.h:65: Invalid cast, from Null to Array
Stack trace:
  [bt] (0) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x73) [0x7f88839d4df3]
  [bt] (1) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::JsonArray const* xgboost::Cast<xgboost::JsonArray const, xgboost::Value>(xgboost::Value*)+0x25c) [0x7f8883a6a05c]
  [bt] (2) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::RegTree::LoadModel(xgboost::Json const&)+0x3ab) [0x7f88839f94bb]
  [bt] (3) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::gbm::GBTreeModel::LoadModel(xgboost::Json const&)+0x3d9) [0x7f88839f8c89]
  [bt] (4) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::gbm::GBTree::LoadModel(xgboost::Json const&)+0x183) [0x7f88839f8fe3]
  [bt] (5) /usr/local/lib/R/site-library/xgboost/libs/xgboost.so(xgboost::LearnerIO::LoadModel(xgboost::Json const&)+0x4d1) [0x7f8883abea81]
  [bt] (6) /usr

nigel_carpenter · December 19, 2020, 6:04pm

Hi @jyotish I hit the same [bt] (6) error as @jeremiedb earlier in the day. I too was loading an R script that had a dependency on xgboost. The code was running fine locally and was essentially the same as my earlier successful colab script.

Would you agree that the error looks like an issue in your zip submission environment? Maybe a problem loading or compiling the xgboost package?

I suspect xgboost (or lightGBM & CATBOOST) will be popular packages for this competition. Perhaps you could amend the getting started notebooks to a stage where they at least install and load these libraries?

I ask that because I suspect it will be easier for you to fix than it is for us users to troubleshoot. We don’t have the same experience in your setup or access to the backend messages and infrastructure.

Must add that I greatly appreciate that you are online and responding to queries!

mohanty · December 19, 2020, 6:21pm

@jyotish: Can we basically have a custom base image where all the common dependencies are already installed ?
It will hopefully both reduce the image build times and also help us avoid back and forth many such dependency related issues.

jyotish · December 19, 2020, 6:28pm

@mohanty @nigel_carpenter

We already preparing a base image for R that has a large number of packages pre-installed. The package installation is taking very long. Will make an announcement as soon as things are ready with the pre-installed package list.

jyotish · December 19, 2020, 6:52pm

@nigel_carpenter

These packages will be pre-installed for R based submissions: Packages available in base environment for R

Let us know if you think we need to add any other package.

mangoloco69 · December 19, 2020, 7:48pm

Maybe I missed something, but how about e.g. data.table , dplyr?
Guessing you intentionally left them out, not seeing them as “base image”?
Thanks

jyotish · December 19, 2020, 7:58pm

Hello @mangoloco69

They were not left out intentionally. We basically installed the packages from https://cran.r-project.org/web/views/MachineLearning.html.

You can still specify the packages you want to install

In your install.R file if you are creating the zip files or
In install_packages function if you are using colab notebook.

dplyr and data.table seem to take only a few seconds to install on top of the base image. Can you try including them from your side?

mangoloco69 · December 19, 2020, 8:07pm

I will include them from my side, no problem. Heavy packages are more important from your side, indeed. Thanks

jeremiedb · December 20, 2020, 6:33pm

@jyotish I’ve been unsuccessful giving another shot at the R zip submission following the update at:

It resulted in the same error message, although running the test.bat works fine.
The only thing I could note was that I had to add the line :
set WEEKLY_EVALUATION=false prior to the first call to predict in order to generate the claims.csv file.

Otherwise, I can’t see what could be wrong. Packages are installed in install.R and loaded where indicated in model.R (even tried to load them within the fit functions which was needed in the Colab submissions - still no success). Have you performed a test on the template to validate it works when using packages? It would be useful to get a working example of scripts that use packages.

nigel_carpenter · December 20, 2020, 7:43pm

@jeremiedb one idea that may be worth trying… I downloaded one of my colab submissions and noticed I got a zip file that seems to look very similar to the format that the zip file submission needs to be in.

Made me wonder if you can then successfully submit this zip file? If yes; can you, through inspection, work out what the contents of the zip need to be like to create a successful zip file submission?

jeremiedb · December 21, 2020, 12:49am

Good hint!
I finally managed to get a R zip submission to work.

I’m unsure which moving part was critical, having all of the following does seem to work:

config.json

{"language": "r"}

Install.R

install_packages <- function() {
  install.packages("data.table")
  install.packages("xgboost")
}

install_packages()

And then have the the various functions split into seperate files as listed in the source call in the predict.R file:

source("fit_model.R")  # Load your code.
source("load_model.R")
source("predict_expected_claim.R")
source("predict_premium.R")
source("preprocess_X_data.R")

# This script expects sys.args arguments for (1) the dataset and (2) the output file.
output_dir = Sys.getenv('OUTPUTS_DIR', '.')
input_dataset = Sys.getenv('DATASET_PATH', 'training_data.csv')  # The default value.
output_claims_file = paste(output_dir, 'claims.csv', sep = '/')  # The file where the expected claims should be saved.
output_prices_file = paste(output_dir, 'prices.csv', sep = '/')  # The file where the prices should be saved.
model_output_path = 'trained_model.RData'

args = commandArgs(trailingOnly=TRUE)

if(length(args) >= 1){
  input_dataset = args[1]
}
if(length(args) >= 2){
  output_claims_file = args[2]
}
if(length(args) >= 3){
  output_prices_file = args[3]
}

# Load the dataset.
# Remove the claim_amount column if it is in the dataset.
Xraw = read.csv(input_dataset)

if('claim_amount' %in% colnames(input_dataset)){
  Xraw = within(Xraw, rm('claim_amount'))
}


# Load the saved model, and run it.
trained_model = load_model(model_output_path)

if(Sys.getenv('WEEKLY_EVALUATION', 'false') == 'true') {
  claims = predict_premium(trained_model, Xraw)
  write.table(x = claims, file = output_claims_file, row.names = FALSE, col.names=FALSE, sep = ",")
} else {
  prices = predict_expected_claim(trained_model, Xraw)
  write.table(x = prices, file = output_prices_file, row.names = FALSE, col.names=FALSE, sep = ",")
}

simon_coulombe · December 21, 2020, 3:34am

I have a couple questions if you guys haven’t given up yet :

I’m not sure where to put the library() calls.

Also, my preprocessing involves a recipes::recipe() that wrangles the data and creates the dummy variables. I’d need to attach that ‘trained recipe’ to the submission, or re-train it everytime from the original “Training.csv”. Is it possible to attach more files to the submissions, like a my_trained_recipe.Rdata file?

cheers

jeremiedb · December 21, 2020, 3:59am

For the zip submission, I went for the shotgun approach and likely loaded the libraries at too many places, put at least the following does work. I added require() / library() right after the beginning of each of the functions preprocess_X_data, predict_expected_claim, predict_premium and fit_model, for example:

preprocess_X_data <- function (x_raw){
  require("data.table")
  require("xgboost")
  ...

Note that for the zip submission, no need to include the fit_model.R, it really seems like all that is necessary are the files invoked by the predict.R (so the model is directly loaded from trained_model.RData.

jyotish · December 21, 2020, 4:37am

Hello @nigel_carpenter

Indeed. When you try to submit via colab notebook, we are essentially creating a zip file and submitting that zip file to AIcrowd. For both python and R based submissions, you can see a directory submission_dir and submission.zip file on your colab notebook. The submission.zip file is the exact file that is submitted.

jyotish · December 21, 2020, 4:41am

Hello @jeremiedb

We did test the starter kit with a few packages installed. If you are still facing this issue, we would be happy to debug your submission.

Regarding including the packages, one starting point would be to include/load the library/packages when one of it’s functions is invoked in the subsequent steps.

However, if you are preparing the zip file yourself, you are free to organize your code as you like. If the test.bat file works for you locally, it is expected to work during the evaluation as well. Can you share the submission ID where the submission worked locally for you but failed during evaluation?

jyotish · December 21, 2020, 4:46am

Hello @simon_coulombe

You can include any file that you want in your submission.zip file. However, you might want to check this post, Question about data, on including the training data during evaluation.

@jeremiedb Yes, the fit_model is not needed for evaluation ~~and is optional to submit. But it would be good if you can include the code used for your training so that it becomes easier for us to validate the submissions some time later.~~
But the fit_model function should be submitted along with your prediction code for our validation.

jeremiedb · December 21, 2020, 5:04am

Thanks @jyotish. Following the above steps, the zip file submission now works.
Something I noticed from downloading my Colab submission was that there seems to be a reversal between claims and prices at the end of the predict.R file:

  claims = predict_premium(trained_model, Xraw)

Not sure if this would have been a problem I introduced or if it related to the template linked to the submit.sh utility.

On another note, leaderboard seems to be based on my worst submission (id 110383) instead of the best or most recent ( 110384, 110389)

jyotish · December 21, 2020, 5:13am

Hello @jeremiedb

Thanks for pointing this out. It is a typo

The important thing in the script is that we invoke predict_premium function during the weekly leaderboard evaluation and predict_expected_claim for RMSE leaderboard evaluation. We will fix the variables names.

nigel_carpenter · December 21, 2020, 2:23pm

I thought that was rather cruel when I saw it! If only all the others that are rapidly catching my RMSE score could oblige and do the same. It would take the heat off me to find a better submission!