Hey all!
I thought I’d start a thread to share ideas and approaches. I believe it’s been very hard to keep quiet for a few of us.
I just added ideas as I had them and I didnt work much on model selection, because that felt like work. @alfarzan might say I went with a kitchen sink approach.
My model is very similar to the starterpack I shared earlier. Here is the final version repository. Honestly this won’t be complicated as I spent a lot of time thinking but not much time working :). It’s a single model, xgboost, tweedie.
Did you try GAMs ? poisson * gamma? logistic * gamma?
Feature engineering
Here are some feature engineering ideas I liked:
a) create a “town_id” variable by concatenating the population and surface area. Then you can treat it like any other categorical vairable with many possible values, like target encoding or something else.
b) create a “vh_current_Value” variable by depreciating cars 20% every year.
c) I tried to make a “known perfect bonus-malus values” indicator. According to the dictionnary, you started at a precise value and decreased by a specific amount every claim-free year. If you had a claim, your bonus-malus increased by a value that set you off the “perfect track” forever. This would have been useful for new business. In practice, having a claim only increased the bonus-malus value by 1 year, so it didnt work.
I didnt create any variable dependend on the previous years, because (to this day) I still don’t know if we get years 1-5 for “new business” for the final leaderboard or just year 5 data. I assume it’s only year 5, so I didnt want to be to dependent on counts (like increases of the “bonus-malus” variable, or city/car changes)
train_my_recipe <- function(.data) {
my_first_recipe <-
recipes::recipe(
claim_amount ~ .,
.data[0, ]
) %>%
recipes::step_mutate(
light_slow = if_else(vh_weight < 400 & vh_speed < 130, 1, 0, NA_real_),
light_fast = if_else(vh_weight < 400 & vh_speed > 200, 1, 0, NA_real_),
town_id = paste(population, 10 * town_surface_area, sep = "_"),
age_when_licensed = drv_age1 - drv_age_lic1,
pop_density = population / town_surface_area,
young_man_drv1 = as.integer((drv_age1 <= 24 & drv_sex1 == "M")),
fast_young_man_drv1 = as.integer((drv_age1 <= 30 & drv_sex1 == "M" & vh_speed >= 200)),
young_man_drv2 = as.integer((drv_age2 <= 24 & drv_sex2 == "M")),
# no_known_claim_values = as.integer(pol_no_claims_discount %in% no_known_claim_values),
year = if_else(year <= 4, year, 4), # replace year 5 with a 4.
vh_current_value = vh_value * 0.8^(vh_age - 1), # depreciate 20% per year
vh_time_left = pmax(20 - vh_age, 0),
pol_coverage_int = case_when(
pol_coverage == "Min" ~ 1,
pol_coverage == "Med1" ~ 2,
pol_coverage == "Med2" ~ 3,
pol_coverage == "Max" ~ 4
),
pol_pay_freq_int = case_when(
pol_pay_freq == "Monthly" ~ 1,
pol_pay_freq == "Quarterly" ~ 2,
pol_pay_freq == "Biannual" ~ 3,
pol_pay_freq == "Yearly" ~ 4
)
) %>%
recipes::step_other(recipes::all_nominal(), threshold = 0.005) %>%
recipes::step_string2factor(recipes::all_nominal()) %>%
# 2 way interact
recipes::step_interact(~ pol_coverage_int:vh_current_value) %>%
recipes::step_interact(~ pol_coverage_int:vh_time_left) %>%
recipes::step_interact(~ pol_coverage_int:pol_no_claims_discount) %>%
recipes::step_interact(~ vh_current_value:vh_time_left) %>%
recipes::step_interact(~ vh_current_value:pol_no_claims_discount) %>%
recipes::step_interact(~ vh_time_left:pol_no_claims_discount) %>%
# 3 way intertac
recipes::step_interact(~ pol_coverage_int:vh_current_value:vh_age) %>%
# remove id
step_rm(contains("id_policy")) %>%
# recipes::step_novel(all_nominal()) %>%
recipes::step_dummy(all_nominal(), one_hot = TRUE)
prepped_first_recipe <- recipes::prep(my_first_recipe, .data, retain = FALSE)
return(prepped_first_recipe)
}
Feature importance table is here
Exporting test data?
I considered using the decimals in the price to export the test data. For example,103.12125$ could mean I charged 12% profit margin, “1” means man and “25” means “25 yo old”. We never were provided with enough information (and that’s a good thing, because that’s a nasty hack hehe).
Random pricing
I used the same claims model for weeks 1 to 10. For pricing, I started with a 20% profit margin and a minimum price of 25$. Then I started using a random profit margin between 1 and 100%. The % was determined by the first 2 decimals of the predicted claims. If the predicted claim was 92.12$ , then I would apply a 12% profit margin to reach a price of 103.13$. I would then replace the first 2 decimals with the profit margin so that I remember what I charged (thus 103.12$). The idea was that if we received feedback at the policy level I could know after a week what my market share would be at any possible profit margin. This was useless, because we never got market feedback at the policy level.
The random profit margin model earned a podium twice, which I found very funny.
Random pricing did have a use. Looking at the conversion_rate and the average profit of the often/sometimes/never sold quotes for the weeks using random profit margins, I have a feeling that around 20% was a good number. For example, if you look at my “financials by conversion rate” table in the week 8 feedback, then you see that the higher the profit margin, the less often I sell a quote (obviously). You also see that my highest profit per policy was for the policies sold “sometimes”, and the average profit margin for that group was 21%.
Financials By Conversion Rate, week 8
## # A tibble: 4 x 6
## `Policies won` claim_frequency premiums conversion_rate `profit per poli…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 often: 34.0 - 100.… 0.09 92.9 0.74 -24.2
## 2 sometimes: 1.4 - 3… 0.1 132. 0.11 1.9
## 3 rarely: 0.1 - 1.4% 0.1 143. 0.01 0.19
## 4 never: 0.11 221. 0 0
## # … with 1 more variable: profit_margin <dbl>
In the final weeks I tried a random profit margin between 20-45%, the idea being to try to get a few “very profitable policies” by trying to sell at 30+% profit margin, while also ensuring that I sell at least a few policies and remain on the leaderboard thanks to the 20-30% profit margin on half the quotes. That didnt work very well.
My model has been profitable for the first half of the competition with a very small market share, until people caught on that you needed to have high profit margins.
Weekly feedbacks
HEre are the links to my weekly feedbacks
week 10 (20-45% profit margin)
week 9 (20-45% profit margin)
week 8 (1-100% profit margin)
week 7 (20% profit margin)
week 6 (1-100% profit margin)
week 5 (20% profit margn)
week 4 (20% profit)
week 3
week 2
week 1 (20% profit)
For the final week, I saved the number of claims each policy_id has had during the 4 years. Each number has a markup. 1 claim is about 25%, 3 claims is about 120% … and 4 claims I just dont want you (multiply claims by 10000 to get price). New business are charged 5% more than claims-free renewal because I don’t know how many claims they have