Baseline submission for OLNWP

Contribute Download Execute In Colab

Getting Started Code for OLNWP Educational Challenge

Author - Pulkit Gera

In [0]:
!pip install numpy
!pip install pandas
!pip install sklearn

Download data

The first step is to download our train test data. We will be training a classifier on the train data and make predictions on test data. We submit our predictions

In [1]:
!rm -rf data
!mkdir data
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-practice-challenges/public/olnwp/v0.1/test.zip
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-practice-challenges/public/olnwp/v0.1/train.zip
!unzip train.zip
!unzip test.zip
!mv train.csv data/train.csv
!mv test.csv data/test.csv
--2020-05-18 00:59:10--  https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_olnwp/data/public/test.zip
Resolving s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)... 130.117.252.16, 130.117.252.12, 130.117.252.13, ...
Connecting to s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)|130.117.252.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2628035 (2.5M) [application/zip]
Saving to: ‘test.zip’

test.zip            100%[===================>]   2.51M  --.-KB/s    in 0.05s   

2020-05-18 00:59:10 (53.6 MB/s) - ‘test.zip’ saved [2628035/2628035]

--2020-05-18 00:59:12--  https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_olnwp/data/public/train.zip
Resolving s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)... 130.117.252.16, 130.117.252.11, 130.117.252.12, ...
Connecting to s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)|130.117.252.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5406140 (5.2M) [application/zip]
Saving to: ‘train.zip’

train.zip           100%[===================>]   5.16M  27.1MB/s    in 0.2s    

2020-05-18 00:59:13 (27.1 MB/s) - ‘train.zip’ saved [5406140/5406140]

Archive:  train.zip
  inflating: train.csv               
Archive:  test.zip
  inflating: test.csv                

Import necessary packages

In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Load Data

  • We use pandas 🐼 library to load our data.
  • Pandas loads the data into dataframes and facilitates us to analyse the data.
  • Learn more about it here 🤓
In [0]:
all_data = pd.read_csv('data/train.csv')

Clean and analyse the data

In [0]:
all_data = all_data.drop('url',1)
all_data.head()
Out[0]:
timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares
0 525.0 10.0 238.0 0.658120 1.0 0.821918 7.0 5.0 1.0 0.0 4.516807 9.0 0.0 0.0 0.0 0.0 1.0 0.0 4.0 1100.0 344.625 0.0 843300.0 138888.888889 0.0 3276.068815 2119.142483 751.0 751.0 751.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.022273 0.336825 0.113219 0.171540 0.356143 0.333174 0.113796 0.050420 0.008403 0.857143 0.142857 0.188510 0.100000 0.4 -0.133333 -0.166667 -0.10 0.250000 0.000000 0.250000 0.000000 782
1 273.0 11.0 545.0 0.474170 1.0 0.587719 21.0 2.0 21.0 1.0 4.836697 6.0 0.0 1.0 0.0 0.0 0.0 0.0 -1.0 1100.0 364.000 0.0 843300.0 215050.000000 0.0 3983.687500 2833.025154 1500.0 27100.0 14300.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.033335 0.034267 0.033334 0.865730 0.033334 0.424611 0.101154 0.034862 0.025688 0.575758 0.424242 0.401356 0.100000 0.9 -0.248214 -0.300000 -0.05 0.000000 0.000000 0.500000 0.000000 6200
2 423.0 10.0 453.0 0.518265 1.0 0.669173 21.0 5.0 15.0 0.0 4.772627 8.0 0.0 1.0 0.0 0.0 0.0 0.0 4.0 1400.0 323.250 1200.0 843300.0 211887.500000 1031.0 16100.000000 4916.574383 5900.0 17300.0 11600.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.025538 0.025859 0.025003 0.898352 0.025248 0.459715 0.135561 0.050773 0.011038 0.821429 0.178571 0.309091 0.100000 0.5 -0.380000 -0.700000 -0.20 0.300000 0.200000 0.200000 0.200000 723
3 80.0 11.0 814.0 0.456885 1.0 0.608787 2.0 2.0 1.0 0.0 4.671990 7.0 0.0 0.0 1.0 0.0 0.0 0.0 -1.0 478.0 94.800 0.0 843300.0 337785.714286 0.0 4104.888889 2303.844586 1900.0 2700.0 2300.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.735399 0.028575 0.178492 0.028594 0.028939 0.442508 0.131205 0.039312 0.019656 0.666667 0.333333 0.376799 0.033333 1.0 -0.195312 -0.600000 -0.05 0.277273 0.218182 0.222727 0.218182 809
4 653.0 11.0 113.0 0.711712 1.0 0.878788 5.0 4.0 0.0 0.0 4.504425 8.0 0.0 0.0 1.0 0.0 0.0 0.0 217.0 640.0 395.000 0.0 617900.0 112062.500000 0.0 5678.750000 2438.866301 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.524560 0.152964 0.271398 0.025741 0.025337 0.421402 0.325379 0.070796 0.000000 1.000000 0.000000 0.366212 0.136364 0.8 0.000000 0.000000 0.00 0.375000 -0.125000 0.125000 0.125000 1600

Here we use the describe function to get an understanding of the data. It shows us the distribution for all the columns. You can use more functions like info() to get useful info.

In [0]:
all_data.describe()
#all_data.info()
Out[0]:
timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares
count 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000 26561.000000
mean 354.110802 10.403449 552.377282 0.555933 1.009337 0.696678 10.898648 3.304733 4.588344 1.259177 4.548023 7.233538 0.052822 0.179398 0.157599 0.059900 0.185460 0.212643 25.764354 1153.234790 313.182115 13539.935582 753374.752457 259086.610210 1119.137347 5657.790085 3136.969229 3853.119811 10324.538299 6288.260675 0.170438 0.186552 0.186627 0.181017 0.144272 0.062422 0.068672 0.131094 0.184483 0.141560 0.216390 0.222987 0.234542 0.443474 0.119235 0.039607 0.016587 0.682175 0.287856 0.353623 0.094825 0.757686 -0.259757 -0.522776 -0.107678 0.282236 0.071113 0.342243 0.156345 3369.156094
std 213.485655 2.122533 472.605248 4.300199 6.389915 3.987187 11.254509 3.855560 8.377796 4.212860 0.844332 1.910501 0.223682 0.383693 0.364372 0.237306 0.388678 0.409185 69.226548 3412.864512 616.649425 57567.108728 213790.768136 134639.145918 1136.222417 5941.869520 1323.234647 18370.270892 41728.980107 23239.178344 0.376024 0.389559 0.389619 0.385040 0.351372 0.241926 0.252900 0.337510 0.262251 0.219365 0.282227 0.294359 0.289642 0.116349 0.096861 0.017340 0.010769 0.190151 0.156012 0.104505 0.070493 0.247909 0.128229 0.290208 0.096784 0.324309 0.266373 0.188296 0.227084 10971.259269
min 8.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 -1.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.393750 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 -1.000000 -1.000000 0.000000 -1.000000 0.000000 0.000000 1.000000
25% 165.000000 9.000000 248.000000 0.470000 1.000000 0.625430 4.000000 1.000000 1.000000 0.000000 4.479927 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 448.000000 142.025714 0.000000 843300.000000 173010.000000 0.000000 3564.727273 2383.380145 642.000000 1100.000000 989.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.025046 0.025012 0.028571 0.028571 0.028574 0.396661 0.058033 0.028409 0.009615 0.600000 0.185185 0.306110 0.050000 0.600000 -0.327976 -0.700000 -0.125000 0.000000 0.000000 0.166667 0.000000 948.000000
50% 339.000000 10.000000 415.000000 0.538251 1.000000 0.690323 8.000000 3.000000 1.000000 0.000000 4.663265 7.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 662.000000 235.333333 1400.000000 843300.000000 244244.444444 1034.062500 4354.292564 2870.427004 1200.000000 2900.000000 2207.689655 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.033389 0.033345 0.040004 0.040001 0.040830 0.453800 0.119019 0.039106 0.015306 0.710145 0.280000 0.358506 0.100000 0.800000 -0.253385 -0.500000 -0.100000 0.142857 0.000000 0.500000 0.000000 1400.000000
75% 540.000000 12.000000 724.000000 0.607735 1.000000 0.754011 14.000000 4.000000 4.000000 1.000000 4.854890 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 1000.000000 356.857143 7900.000000 843300.000000 330510.000000 2054.000000 6015.439290 3596.860531 2600.000000 7900.000000 5175.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.242231 0.151394 0.333707 0.371249 0.399841 0.507733 0.177513 0.050204 0.021739 0.800000 0.384615 0.410606 0.100000 1.000000 -0.187500 -0.300000 -0.050000 0.500000 0.146667 0.500000 0.250000 2800.000000
max 731.000000 23.000000 7185.000000 701.000000 1042.000000 650.000000 304.000000 116.000000 128.000000 91.000000 8.041534 10.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 377.000000 158900.000000 39979.000000 843300.000000 843300.000000 843300.000000 3613.039819 237966.666667 37607.521654 690400.000000 843300.000000 690400.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.926994 0.925947 0.919999 0.925542 0.927191 1.000000 0.655000 0.155488 0.184932 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.500000 1.000000 690400.000000

Split Data into Train and Validation 🔪

  • The next step is to think of a way to test how well our model is performing. we cannot use the test data given as it does not contain the data labels for us to verify.
  • The workaround this is to split the given training data into training and validation. Typically validation sets give us an idea of how our model will perform on unforeseen data. it is like holding back a chunk of data while training our model and then using it to for the purpose of testing. it is a standard way to fine-tune hyperparameters in a model.
  • There are multiple ways to split a dataset into validation and training sets. following are two popular ways to go about it, k-fold, leave one out. 🧐
  • Validation sets are also used to avoid your model from overfitting on the train dataset.
In [0]:
X = all_data.drop(' shares',1)
y = all_data[' shares']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
  • We have decided to split the data with 20 % as validation and 80 % as training.
  • To learn more about the train_test_split function click here. 🧐
  • This is of course the simplest way to validate your model by simply taking a random chunk of the train set and setting it aside solely for the purpose of testing our train model on unseen data. as mentioned in the previous block, you can experiment 🔬 with and choose more sophisticated techniques and make your model better.
  • Now, since we have our data splitted into train and validation sets, we need to get the corresponding labels separated from the data.
  • with this step we are all set move to the next step with a prepared dataset.

TRAINING PHASE 🏋️

Define the Model and Train

Define the Model

  • We have fixed our data and now we are ready to train our model.

  • There are a ton of regressors to choose from some being Linear Regression, , Random Forests, Decision Trees, etc.🧐

  • Remember that there are no hard-laid rules here. you can mix and match classifiers, it is advisable to read up on the numerous techniques and choose the best fit for your solution , experimentation is the key.

  • A good model does not depend solely on the classifier but also on the features you choose. So make sure to analyse and understand your data well and move forward with a clear view of the problem at hand. you can gain important insight from here.🧐

In [0]:
regressor = LinearRegression()  

# from sklearn import tree
# regressor = tree.DecisionTreeRegressor()
  • We have used Linear Regression as a model here and set few of the parameteres.
  • One can set more parameters and increase the performance. To see the list of parameters visit here.
  • Do keep in mind there exist sophisticated techniques for everything, the key as quoted earlier is to search them and experiment to fit your implementation.
  • Also given Decision Tree examples. Check out Decision Tree's parameters here

Train the Model

In [0]:
regressor.fit(X_train, y_train)
Out[0]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Check which variables have the most impact

We now take this time to identify the columns that have the most impact. This is used to remove the columns that have negligble impact on the data and improve our model.

In [0]:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
coeff_df.head()
Out[0]:
Coefficient
timedelta 1.371829
n_tokens_title 134.279025
n_tokens_content 0.321616
n_unique_tokens 4477.371557
n_non_stop_words -2579.368312

Validation Phase 🤔

Wonder how well your model learned! Lets check it.

Predict on Validation

Now we predict using our trained model on the validation set we created and evaluate our model on unforeseen data.

In [0]:
y_pred = regressor.predict(X_val)

Evaluate the Performance

  • We have used basic metrics to quantify the performance of our model.
  • This is a crucial step, you should reason out the metrics and take hints to improve aspects of your model.
  • Do read up on the meaning and use of different metrics. there exist more metrics and measures, you should learn to use them correctly with respect to the solution,dataset and other factors.
  • MAE and RMSE are the metrics for this challenge.
In [0]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))
Mean Absolute Error: 3174.901688002103
Mean Squared Error: 168520453.62617025
Root Mean Squared Error: 12981.542806083191

Testing Phase 😅

We are almost done. We trained and validated on the training data. Now its the time to predict on test set and make a submission.

Load Test Set

Load the test data on which final submission is to be made.

In [0]:
test_data = pd.read_csv('data/test.csv')
In [0]:
test_data = test_data.drop('url',1)
test_data.head()
Out[0]:
timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity
0 121.0 12.0 1015.0 0.422018 1.0 0.545031 10.0 6.0 33.0 1.0 4.656158 4.0 0.0 0.0 1.0 0.0 0.0 0.0 -1.0 263.0 110.500000 6500.0 843300.0 398350.000000 1809.075 3483.806797 2729.047648 1100.0 22100.0 6475.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.331582 0.050050 0.050035 0.050000 0.518333 0.471175 0.159889 0.041379 0.008867 0.823529 0.176471 0.333534 0.100000 0.8 -0.160714 -0.50 -0.071429 0.0 0.00 0.5 0.00
1 532.0 9.0 503.0 0.569697 1.0 0.737542 9.0 0.0 1.0 1.0 4.576541 10.0 0.0 0.0 0.0 0.0 1.0 0.0 4.0 3200.0 524.750000 0.0 843300.0 117960.000000 0.000 4228.114286 2387.526307 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.020007 0.020008 0.325602 0.020004 0.614379 0.477791 0.123520 0.033797 0.019881 0.629630 0.370370 0.419786 0.136364 1.0 -0.157500 -0.25 -0.100000 0.0 0.00 0.5 0.00
2 435.0 9.0 232.0 0.646018 1.0 0.748428 12.0 3.0 4.0 1.0 4.935345 6.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0 939.0 198.666667 970.0 843300.0 573878.333333 954.500 6192.239067 4385.022237 1400.0 58800.0 30100.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.033334 0.033697 0.033333 0.866302 0.033333 0.522234 -0.163235 0.017241 0.043103 0.285714 0.714286 0.468750 0.375000 0.5 -0.427500 -1.00 -0.187500 0.0 0.00 0.5 0.00
3 134.0 12.0 171.0 0.722892 1.0 0.867925 9.0 5.0 0.0 1.0 4.970760 6.0 0.0 0.0 1.0 0.0 0.0 0.0 -1.0 2100.0 444.166667 5600.0 843300.0 311033.333333 2076.520 4529.427500 3269.856640 974.0 5600.0 2574.8 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.700107 0.033335 0.033334 0.199402 0.033822 0.405128 -0.006410 0.011696 0.029240 0.285714 0.714286 0.500000 0.500000 0.5 -0.216667 -0.25 -0.166667 0.4 -0.25 0.1 0.25
4 728.0 11.0 286.0 0.652632 1.0 0.800000 5.0 2.0 0.0 0.0 5.006993 8.0 0.0 0.0 0.0 0.0 1.0 0.0 217.0 552.0 356.200000 0.0 28000.0 6830.125000 0.000 2240.536313 976.913444 822.0 822.0 822.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.214708 0.025062 0.025016 0.025187 0.710028 0.418036 0.060089 0.034965 0.024476 0.588235 0.411765 0.303429 0.100000 0.6 -0.251786 -0.50 -0.100000 0.2 -0.10 0.3 0.10

Predict on test set

Time for the moment of truth! Predict on test set and time to make the submission.

In [0]:
y_test = regressor.predict(test_data)

Since its integer regression, convert to integers

In [0]:
y_inttest = [int(i) for i in y_test]
y_inttest = np.asarray(y_inttest)

Save the prediction to csv

In [0]:
df = pd.DataFrame(y_inttest,columns=[' shares'])
df.to_csv('submission.csv',index=False)

🚧 Note :

  • Do take a look at the submission format.
  • The submission file should contain a header.
  • Follow all submission guidelines strictly to avoid inconvenience.

To download the generated in collab csv run the below command

In [0]:
try:
  from google.colab import files
  files.download('submission.csv')
except ImportError as e:
  print("Only for Collab")

Well Done! 👍 We are all set to make a submission and see you name on leaderborad. Let navigate to challenge page and make one.