🚀 R Starter Kit

jyotish · December 18, 2020, 11:48am

Notebook

How to use this notebook 📝¶

Copy the notebook. This is a shared template and any edits you make here will not be saved. You should copy it into your own drive folder. For this, click the "File" menu (top-left), then "Save a Copy in Drive". You can edit your copy however you like.
Link it to your AICrowd account. In order to submit your code to AICrowd, you need to provide your account's API key (see "Configure static variables" for details).
Stick to the function definitions. The submission to AICrowd will look for the pre-defined function names:
- install_packages
- fit_model
- save_model
- load_model
- predict_expected_claim
- predict_premium
- preprocess_X_data
  
  Anything else you write outside of these functions will not be part of the final submission (including constants and utility functions), so make sure everything is defined within them, except for:

Your pricing model 🕵️¶

In this notebook, you can play with the data, and define and train your pricing model. You can then directly submit it to the AICrowd server, with some magic code at the end.

Prepare the notebook 🛠¶

In [ ]:

cat(system('curl -sL https://gitlab.aicrowd.com/jyotish/pricing-game-notebook-scripts/raw/r-functions/r/setup.sh > setup.sh && bash setup.sh', intern=TRUE), sep='\n')
source("aicrowd_helpers.R")

⚙️ Installing AIcrowd utilities...
✅ Installed AIcrowd utilities
💾 Downloading training data...
✅ Downloaded training data

Configure static variables 📎¶

In order to submit using this notebook, you must visit this URL https://aicrowd.com/participants/me and copy your API key.

Then you must set the value of AICROWD_API_KEY wuth the value.

In [ ]:

TRAINING_DATA_PATH = 'training.csv'
MODEL_OUTPUT_PATH = 'trained_model.RData'  # Alter if not using .RData files
AICROWD_API_KEY = ''  # You can get the key from https://aicrowd.com/participants/me

Download dataset files 💾¶

In [ ]:

download_aicrowd_dataset(AICROWD_API_KEY)

Packages 🗃¶

Install and require here all the packages you need to define your model.

Note: Installing packages the first time might take some time.

In [ ]:

install_packages <- function() {
  # install.packages("caret")
  # install.packages("rpart")
}
install_packages()

In [ ]:

global_imports <- function() {
  # require("caret")
  # require("rpart")
}
global_imports()

NULL

Loading the data 📲¶

In [ ]:

# Load the dataset.
train_data = read.csv(TRAINING_DATA_PATH)

# Create a model, train it, then save it.
Xdata = within(train_data, rm('claim_amount'))
ydata = train_data['claim_amount']

How does the data look like? 🔍¶

In [ ]:

as.matrix(head(Xdata, 4))

In [ ]:

as.matrix(head(ydata, 4))

Training the model 🚀¶

You must first define your first function: fit_model. This function takes training data as arguments, and outputs a "model" object -- that you define as you wish. For instance, this could be an array of parameter values.

You may want to define the function preprocess_X_data that prepares and cleans your predictor variables for the training and prediction.

Define your data preprocessing¶

You can add any class or function in this cell for preprocessing. Just make sure that you use the functions here in the fit_model, predict_expected_claim and predict_premium functions if necessary.

In [ ]:

preprocess_X_data <- function (x_raw){
	# Data preprocessing function: given X_raw, clean the data for training or prediction.

	# Parameters
	# ----------
	# X_raw : Dataframe, with the columns described in the data dictionary.
	# 	Each row is a different contract. This data has not been processed.

	# Returns
	# -------
	# A cleaned / preprocessed version of the dataset

  # YOUR CODE HERE ------------------------------------------------------
  
  
  # ---------------------------------------------------------------------
  return(x_raw) # change this to return the cleaned data
}

Define the training logic¶

In [ ]:

fit_model <- function (x_raw, y_raw){
	# Model training function: given training data (X_raw, y_raw), train this pricing model.

	# Parameters
	# ----------
	# X_raw : Dataframe, with the columns described in the data dictionary.
	# 	Each row is a different contract. This data has not been processed.
	# y_raw : a array, with the value of the claims, in the same order as contracts in X_raw.
	# 	A one dimensional array, with values either 0 (most entries) or >0.

	# Returns
	# -------
	# self: (optional), this instance of the fitted model.

	
  # This function trains your models and returns the trained model.
  
  # YOUR CODE HERE ------------------------------------------------------

  # x_clean = preprocess_X_data(x_raw)  # preprocess your data before fitting

  trained_model = lm(unlist(ydata) ~ 1) # toy linear model
  
  # ---------------------------------------------------------------------
  # The result trained_model is something that you will save in the next section
  return(trained_model)
}

In [ ]:

model = fit_model(Xdata, ydata)

Saving your model¶

You can save your model to a file here, so you don't need to retrain it every time.

In [ ]:

save_model <- function(model, output_path){
  # Saves this trained model to a file.

  # This is used to save the model after training, so that it can be used for prediction later.

  # Do not touch this unless necessary (if you need specific features). If you do, do not
  #  forget to update the load_model method to be compatible.
	
  # Save in `trained_model.RData`.

  save(model, file=output_path)
}

In [ ]:

save_model(model, MODEL_OUTPUT_PATH)

If you need to load it from file, you can use this code:

In [ ]:

load_model <- function(model_path){ 
 # Load a saved trained model from the file `trained_model.RData`.

 #    This is called by the server to evaluate your submission on hidden data.
 #    Only modify this *if* you modified save_model.

  load(model_path)
  return(model)
}

In [ ]:

model = load_model(MODEL_OUTPUT_PATH)

Predicting the claims 💵¶

The second function, predict_expected_claim, takes your trained model and a dataframe of contracts, and outputs a prediction for the (expected) claim incurred by each contract. This expected claim can be seen as the probability of an accident multiplied by the cost of that accident.

This is the function used to compute the RMSE leaderboard, where the model best able to predict claims wins.

In [ ]:

predict_expected_claim <- function(model, x_raw){
	# Model prediction function: predicts the average claim based on the pricing model.

	# This functions estimates the expected claim made by a contract (typically, as the product
	# of the probability of having a claim multiplied by the average cost of a claim if it occurs),
	# for each contract in the dataset X_raw.

	# This is the function used in the RMSE leaderboard, and hence the output should be as close
	# as possible to the expected cost of a contract.

	# Parameters
	# ----------
	# X_raw : Dataframe, with the columns described in the data dictionary.
	# 	Each row is a different contract. This data has not been processed.

	# Returns
	# -------
	# avg_claims: a one-dimensional array of the same length as X_raw, with one
	#     average claim per contract (in same order). These average claims must be POSITIVE (>0).


  # YOUR CODE HERE ------------------------------------------------------

  # x_clean = preprocess_X_data(x_raw)  # preprocess your data before fitting
  expected_claims = predict(model, newdata = x_raw)  # tweak this to work with your model

  return(expected_claims)  
}

In [ ]:

claims <- predict_expected_claim(model, Xdata)

Pricing contracts 💰¶

The third and final function, predict_premium, takes your trained model and a dataframe of contracts, and outputs a price for each of these contracts. You are free to set this prices however you want! These prices will then be used in competition with other models: contracts will choose the model offering the lowest price, and this model will have to pay the cost if an accident occurs.

This is the function used to compute the profit leaderboard: your model will participate in many markets of size 10, populated by other participants' model, and we compute the average profit of your model over all the markets it participated in.

In [ ]:

predict_premium <- function(model, x_raw){
  # Model prediction function: predicts premiums based on the pricing model.

  # This function outputs the prices that will be offered to the contracts in X_raw.
  # premium will typically depend on the average claim predicted in 
  # predict_expected_claim, and will add some pricing strategy on top.

  # This is the function used in the average profit leaderboard. Prices output here will
  # be used in competition with other models, so feel free to use a pricing strategy.

  # Parameters
  # ----------
  # X_raw : Dataframe, with the columns described in the data dictionary.
  # 	Each row is a different contract. This data has not been processed.

  # Returns
  # -------
  # prices: a one-dimensional array of the same length as X_raw, with one
  #     price per contract (in same order). These prices must be POSITIVE (>0).


  # YOUR CODE HERE ------------------------------------------------------

  # x_clean = preprocess_X_data(x_raw)  # preprocess your data before fitting

  return(predict_expected_claim(model, x_raw))
}

In [ ]:

prices <- predict_premium(model, Xdata)
as.matrix(head(prices))

A matrix: 6 × 1 of type dbl
1	114.1812
2	114.1812
3	114.1812
4	114.1812
5	114.1812
6	114.1812

Profit on training data¶

In order for your model to be considered in the profit competition, it needs to make nonnegative profit over its training set. You can check that your model satisfies this condition below:

In [ ]:

print(paste('Income:', sum(prices)))
print(paste('Losses:', sum(ydata)))

if (sum(prices) < sum(ydata)) {
    print('Your model loses money on the training data! It does not satisfy market rule 1: Non-negative training profit.')
    print('This model will be disqualified from the weekly profit leaderboard, but can be submitted for educational purposes to the RMSE leaderboard.')
} else {
    print('Your model passes the non-negative training profit test!')
}

[1] "Income: 26057988.08"
[1] "Losses: 26057988.08"
[1] "Your model is invalid: it loses money on its training data!"

Ready? Submit to AIcrowd 🚀¶

If you are satisfied with your code, run the code below to send your code to the AICrowd servers for evaluation! This requires the variable trained_model to be defined by your previous code.

Make sure you have included all packages needed to run your code in the "Packages" section.

In [ ]:

aicrowd_submit(AICROWD_API_KEY)

Warning message in system("curl -sL https://gitlab.aicrowd.com/jyotish/pricing-game-notebook-scripts/raw/master/r/submit.sh > submit.sh && bash submit.sh", :
“running command 'curl -sL https://gitlab.aicrowd.com/jyotish/pricing-game-notebook-scripts/raw/master/r/submit.sh > submit.sh && bash submit.sh' had status 1”

🚀 Preparing to submit...
⚙️ Collecting the submission code...
💾 Preparing the submission zip file...
🚫 Failed to login to aicrowd 😢

In [ ]:

jeremiedb · December 19, 2020, 4:56am

Would it be possible to provide some further details on how to proceed to a proper submission?
Despite having passed the tests and and getting a messing acknowledging a proper submission, the leaderboard evaluation returns errors with little indications on where the error is coming from.

For example, I first wanted to use data.table library, despite having the following to work properly in the Colab notebook, it resulted in the error message "Error in data.table(x_raw) : could not find function \"data.table\"" on the submission evaluation section.

install_packages <- function() {
  install.packages("data.table")
    require("data.table")
}

Looking at the submissions, it seems like others also faced issues package installation/usage, so I think some further examples would be welcome.

Also, when opting for the zip submission instead of the Colab, it resulted in "DockerPushError: An image does not exist locally with the tag: aicrowd/imperial-pricing-game". The submission experience has been a little rough so far on the R side!

simon_coulombe · December 21, 2020, 2:16am

I’m getting the following error when running the default notebook for R:

1] “Income: 26057988.08”
[1] “Losses: 26057988.08”
[1] “Your model is invalid: it loses money on its training data!”

jyotish · December 21, 2020, 4:48am

Hello @simon_coulombe

It’s not an error. You can submit the model and you will get a score. The message means that the performance of the model is not good.

simon_coulombe · December 21, 2020, 5:56pm

thanks for replying!
I’m just surprised that it says the model is losing money on the training data since the income is equal to the losses. The rules do state that you are not allowed to submit a model that loses money on the training data so I was afraid that might be an issue down the road.

alfarzan · December 21, 2020, 6:10pm

Hi @simon_coulombe

Yes indeed this is a good point. Two things two consider for now:

We are currently implementing a warning system upon submission on the AICrowd website so that you are warned if your submission is not profitable on training data. If it is not the model will not be entered into the profit leaderboard so it’s important to know in advance. Stay tuned for this update within the next day!
The rule requires positive profits. So a profit of zero is technically invalid. Hence the warning you received