Random variable(s)

demarsylvain · June 4, 2021, 12:30am

Hi,

I edited my notebook by adding new sections, and one of them is “Random” .

When I build a model, I always add a random variable (from uniform distribution) to detect useless variables. I consider that all variables ranked under the random one are explaining noise. So I remove them , and run a simpler and faster model . As these variables were rarely used by the model, predictions and performances are very similar (but always a little lower ). But the model is easier to implement.

Here, the rank is 39 !

It mean I should remove a lot of variables . I tested it: 2 successive models, one with all variables (LB logloss: 0.6136) and one with only 38 variables (LB logloss: 0.6147). Very similar …

My feeling, and it was already discuss in a different post, is that random play a strong part in our submissions. Results and ranks will (randomly?) change on the 40% hidden dataset.

What do you think ?

jyot_makadiya · June 4, 2021, 5:37am

Hello @demarsylvain,
TLDR: My experiments were somewhat different than yours but the conclusion is similar that data contains some noise/random variables which plays a big role in predictions.
I first tried with raw features and minimal pre-processing, then with heavy feature engineering (around 8-10 features were added), and finally experimented with removing unnecessary features. The results were shockingly similar and unintuitive to me. My raw features model performs almost similar to the heavy feature engineered model and minimal features model. The same data with slight changes leads to a drastic difference in log loss (kinda butterfly effect), therefore I think it’s the case of random variable majorly impacting the results. I hope this helps.