Baseline submission for OLNWP

aicrowd-bot · March 27, 2020, 11:11am

Notebook

Getting Started Code for OLNWP Educational Challenge ¶

Author - Pulkit Gera¶

In [0]:

!pip install numpy
!pip install pandas
!pip install sklearn

Download data¶

The first step is to download our train test data. We will be training a classifier on the train data and make predictions on test data. We submit our predictions

In [2]:

!rm -rf data
!mkdir data
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/onlwp/v0.1/train.csv
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/onlwp/v0.1/test.csv

--2020-10-15 10:50:11--  https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/onlwp/v0.1/train.csv
Resolving datasets.aicrowd.com (datasets.aicrowd.com)... 35.189.208.115
Connecting to datasets.aicrowd.com (datasets.aicrowd.com)|35.189.208.115|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3.us-west-002.backblazeb2.com/aicrowd-practice-challenges/public/onlwp/v0.1/train.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=002ae2491b744be0000000002%2F20201015%2Fus-west-002%2Fs3%2Faws4_request&X-Amz-Date=20201015T052013Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=d61a0f5f15463d1d8e9261cb7a2f6ed812bf143ff7afdcb3b4e9c85b661b7220 [following]
--2020-10-15 10:50:13--  https://s3.us-west-002.backblazeb2.com/aicrowd-practice-challenges/public/onlwp/v0.1/train.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=002ae2491b744be0000000002%2F20201015%2Fus-west-002%2Fs3%2Faws4_request&X-Amz-Date=20201015T052013Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=d61a0f5f15463d1d8e9261cb7a2f6ed812bf143ff7afdcb3b4e9c85b661b7220
Resolving s3.us-west-002.backblazeb2.com (s3.us-west-002.backblazeb2.com)... 206.190.215.254
Connecting to s3.us-west-002.backblazeb2.com (s3.us-west-002.backblazeb2.com)|206.190.215.254|:443... connected.
HTTP request sent, awaiting response... 404 
2020-10-15 10:50:14 ERROR 404: (no description).

--2020-10-15 10:50:14--  https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/onlwp/v0.1/test.csv
Resolving datasets.aicrowd.com (datasets.aicrowd.com)... 35.189.208.115
Connecting to datasets.aicrowd.com (datasets.aicrowd.com)|35.189.208.115|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3.us-west-002.backblazeb2.com/aicrowd-practice-challenges/public/onlwp/v0.1/test.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=002ae2491b744be0000000002%2F20201015%2Fus-west-002%2Fs3%2Faws4_request&X-Amz-Date=20201015T052016Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=294591762192e2764303e479a7dc1dff7530ee4495a669fa297edb7b31eab617 [following]
--2020-10-15 10:50:16--  https://s3.us-west-002.backblazeb2.com/aicrowd-practice-challenges/public/onlwp/v0.1/test.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=002ae2491b744be0000000002%2F20201015%2Fus-west-002%2Fs3%2Faws4_request&X-Amz-Date=20201015T052016Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=294591762192e2764303e479a7dc1dff7530ee4495a669fa297edb7b31eab617
Resolving s3.us-west-002.backblazeb2.com (s3.us-west-002.backblazeb2.com)... 206.190.215.254
Connecting to s3.us-west-002.backblazeb2.com (s3.us-west-002.backblazeb2.com)|206.190.215.254|:443... connected.
HTTP request sent, awaiting response... 404 
2020-10-15 10:50:18 ERROR 404: (no description).

unzip:  cannot find or open train.zip, train.zip.zip or train.zip.ZIP.
unzip:  cannot find or open test.zip, test.zip.zip or test.zip.ZIP.
mv: cannot stat 'train.csv': No such file or directory
mv: cannot stat 'test.csv': No such file or directory

Import necessary packages¶

In [0]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Load Data¶

We use pandas 🐼 library to load our data.
Pandas loads the data into dataframes and facilitates us to analyse the data.
Learn more about it here 🤓

In [0]:

all_data = pd.read_csv('data/train.csv')

Clean and analyse the data¶

In [0]:

all_data = all_data.drop('url',1)
all_data.head()

Out[0]:

	timedelta	n_tokens_title	n_tokens_content	n_unique_tokens	n_non_stop_words	n_non_stop_unique_tokens	num_hrefs	num_self_hrefs	num_imgs	num_videos	average_token_length	num_keywords	data_channel_is_entertainment	data_channel_is_bus	data_channel_is_tech	kw_min_min	kw_max_min	kw_avg_min	kw_min_max	kw_max_max	kw_avg_max	kw_min_avg	kw_max_avg	kw_avg_avg	self_reference_min_shares	self_reference_max_shares	self_reference_avg_sharess	weekday_is_monday	weekday_is_tuesday	weekday_is_thursday	LDA_00	LDA_01	LDA_02	LDA_03	LDA_04	global_subjectivity	global_sentiment_polarity	global_rate_positive_words	global_rate_negative_words	rate_positive_words	rate_negative_words	avg_positive_polarity	min_positive_polarity	max_positive_polarity	avg_negative_polarity	min_negative_polarity	max_negative_polarity	title_subjectivity	title_sentiment_polarity	abs_title_subjectivity	abs_title_sentiment_polarity	shares
0	525.0	10.0	238.0	0.658120	1.0	0.821918	7.0	5.0	1.0	0.0	4.516807	9.0	0.0	0.0	1.0	4.0	1100.0	344.625	0.0	843300.0	138888.888889	0.0	3276.068815	2119.142483	751.0	751.0	751.0	0.0	0.0	1.0	0.022273	0.336825	0.113219	0.171540	0.356143	0.333174	0.113796	0.050420	0.008403	0.857143	0.142857	0.188510	0.100000	0.4	-0.133333	-0.166667	-0.10	0.250000	0.000000	0.250000	0.000000	782
1	273.0	11.0	545.0	0.474170	1.0	0.587719	21.0	2.0	21.0	1.0	4.836697	6.0	1.0	0.0	0.0	-1.0	1100.0	364.000	0.0	843300.0	215050.000000	0.0	3983.687500	2833.025154	1500.0	27100.0	14300.0	0.0	0.0	1.0	0.033335	0.034267	0.033334	0.865730	0.033334	0.424611	0.101154	0.034862	0.025688	0.575758	0.424242	0.401356	0.100000	0.9	-0.248214	-0.300000	-0.05	0.000000	0.000000	0.500000	0.000000	6200
2	423.0	10.0	453.0	0.518265	1.0	0.669173	21.0	5.0	15.0	0.0	4.772627	8.0	1.0	0.0	0.0	4.0	1400.0	323.250	1200.0	843300.0	211887.500000	1031.0	16100.000000	4916.574383	5900.0	17300.0	11600.0	1.0	0.0	0.0	0.025538	0.025859	0.025003	0.898352	0.025248	0.459715	0.135561	0.050773	0.011038	0.821429	0.178571	0.309091	0.100000	0.5	-0.380000	-0.700000	-0.20	0.300000	0.200000	0.200000	0.200000	723
3	80.0	11.0	814.0	0.456885	1.0	0.608787	2.0	2.0	1.0	0.0	4.671990	7.0	0.0	1.0	0.0	-1.0	478.0	94.800	0.0	843300.0	337785.714286	0.0	4104.888889	2303.844586	1900.0	2700.0	2300.0	1.0	0.0	0.0	0.735399	0.028575	0.178492	0.028594	0.028939	0.442508	0.131205	0.039312	0.019656	0.666667	0.333333	0.376799	0.033333	1.0	-0.195312	-0.600000	-0.05	0.277273	0.218182	0.222727	0.218182	809
4	653.0	11.0	113.0	0.711712	1.0	0.878788	5.0	4.0	0.0	0.0	4.504425	8.0	0.0	1.0	0.0	217.0	640.0	395.000	0.0	617900.0	112062.500000	0.0	5678.750000	2438.866301	0.0	0.0	0.0	0.0	1.0	0.0	0.524560	0.152964	0.271398	0.025741	0.025337	0.421402	0.325379	0.070796	0.000000	1.000000	0.000000	0.366212	0.136364	0.8	0.000000	0.000000	0.00	0.375000	-0.125000	0.125000	0.125000	1600

Here we use the describe function to get an understanding of the data. It shows us the distribution for all the columns. You can use more functions like info() to get useful info.

In [0]:

all_data.describe()
#all_data.info()

Out[0]:

	timedelta	n_tokens_title	n_tokens_content	n_unique_tokens	n_non_stop_words	n_non_stop_unique_tokens	num_hrefs	num_self_hrefs	num_imgs	num_videos	average_token_length	num_keywords	data_channel_is_lifestyle	data_channel_is_entertainment	data_channel_is_bus	data_channel_is_socmed	data_channel_is_tech	data_channel_is_world	kw_min_min	kw_max_min	kw_avg_min	kw_min_max	kw_max_max	kw_avg_max	kw_min_avg	kw_max_avg	kw_avg_avg	self_reference_min_shares	self_reference_max_shares	self_reference_avg_sharess	weekday_is_monday	weekday_is_tuesday	weekday_is_wednesday	weekday_is_thursday	weekday_is_friday	weekday_is_saturday	weekday_is_sunday	is_weekend	LDA_00	LDA_01	LDA_02	LDA_03	LDA_04	global_subjectivity	global_sentiment_polarity	global_rate_positive_words	global_rate_negative_words	rate_positive_words	rate_negative_words	avg_positive_polarity	min_positive_polarity	max_positive_polarity	avg_negative_polarity	min_negative_polarity	max_negative_polarity	title_subjectivity	title_sentiment_polarity	abs_title_subjectivity	abs_title_sentiment_polarity	shares
count	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000	26561.000000
mean	354.110802	10.403449	552.377282	0.555933	1.009337	0.696678	10.898648	3.304733	4.588344	1.259177	4.548023	7.233538	0.052822	0.179398	0.157599	0.059900	0.185460	0.212643	25.764354	1153.234790	313.182115	13539.935582	753374.752457	259086.610210	1119.137347	5657.790085	3136.969229	3853.119811	10324.538299	6288.260675	0.170438	0.186552	0.186627	0.181017	0.144272	0.062422	0.068672	0.131094	0.184483	0.141560	0.216390	0.222987	0.234542	0.443474	0.119235	0.039607	0.016587	0.682175	0.287856	0.353623	0.094825	0.757686	-0.259757	-0.522776	-0.107678	0.282236	0.071113	0.342243	0.156345	3369.156094
std	213.485655	2.122533	472.605248	4.300199	6.389915	3.987187	11.254509	3.855560	8.377796	4.212860	0.844332	1.910501	0.223682	0.383693	0.364372	0.237306	0.388678	0.409185	69.226548	3412.864512	616.649425	57567.108728	213790.768136	134639.145918	1136.222417	5941.869520	1323.234647	18370.270892	41728.980107	23239.178344	0.376024	0.389559	0.389619	0.385040	0.351372	0.241926	0.252900	0.337510	0.262251	0.219365	0.282227	0.294359	0.289642	0.116349	0.096861	0.017340	0.010769	0.190151	0.156012	0.104505	0.070493	0.247909	0.128229	0.290208	0.096784	0.324309	0.266373	0.188296	0.227084	10971.259269
min	8.000000	3.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-1.000000	0.000000	-1.000000	0.000000	0.000000	0.000000	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-0.393750	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-1.000000	-1.000000	-1.000000	0.000000	-1.000000	0.000000	0.000000	1.000000
25%	165.000000	9.000000	248.000000	0.470000	1.000000	0.625430	4.000000	1.000000	1.000000	0.000000	4.479927	6.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-1.000000	448.000000	142.025714	0.000000	843300.000000	173010.000000	0.000000	3564.727273	2383.380145	642.000000	1100.000000	989.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.025046	0.025012	0.028571	0.028571	0.028574	0.396661	0.058033	0.028409	0.009615	0.600000	0.185185	0.306110	0.050000	0.600000	-0.327976	-0.700000	-0.125000	0.000000	0.000000	0.166667	0.000000	948.000000
50%	339.000000	10.000000	415.000000	0.538251	1.000000	0.690323	8.000000	3.000000	1.000000	0.000000	4.663265	7.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-1.000000	662.000000	235.333333	1400.000000	843300.000000	244244.444444	1034.062500	4354.292564	2870.427004	1200.000000	2900.000000	2207.689655	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.033389	0.033345	0.040004	0.040001	0.040830	0.453800	0.119019	0.039106	0.015306	0.710145	0.280000	0.358506	0.100000	0.800000	-0.253385	-0.500000	-0.100000	0.142857	0.000000	0.500000	0.000000	1400.000000
75%	540.000000	12.000000	724.000000	0.607735	1.000000	0.754011	14.000000	4.000000	4.000000	1.000000	4.854890	9.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.000000	1000.000000	356.857143	7900.000000	843300.000000	330510.000000	2054.000000	6015.439290	3596.860531	2600.000000	7900.000000	5175.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.242231	0.151394	0.333707	0.371249	0.399841	0.507733	0.177513	0.050204	0.021739	0.800000	0.384615	0.410606	0.100000	1.000000	-0.187500	-0.300000	-0.050000	0.500000	0.146667	0.500000	0.250000	2800.000000
max	731.000000	23.000000	7185.000000	701.000000	1042.000000	650.000000	304.000000	116.000000	128.000000	91.000000	8.041534	10.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	377.000000	158900.000000	39979.000000	843300.000000	843300.000000	843300.000000	3613.039819	237966.666667	37607.521654	690400.000000	843300.000000	690400.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.926994	0.925947	0.919999	0.925542	0.927191	1.000000	0.655000	0.155488	0.184932	1.000000	1.000000	1.000000	1.000000	1.000000	0.000000	0.000000	0.000000	1.000000	1.000000	0.500000	1.000000	690400.000000

Split Data into Train and Validation 🔪¶

The next step is to think of a way to test how well our model is performing. we cannot use the test data given as it does not contain the data labels for us to verify.
The workaround this is to split the given training data into training and validation. Typically validation sets give us an idea of how our model will perform on unforeseen data. it is like holding back a chunk of data while training our model and then using it to for the purpose of testing. it is a standard way to fine-tune hyperparameters in a model.
There are multiple ways to split a dataset into validation and training sets. following are two popular ways to go about it, k-fold, leave one out. 🧐
Validation sets are also used to avoid your model from overfitting on the train dataset.

In [0]:

X = all_data.drop(' shares',1)
y = all_data[' shares']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

We have decided to split the data with 20 % as validation and 80 % as training.
To learn more about the train_test_split function click here. 🧐
This is of course the simplest way to validate your model by simply taking a random chunk of the train set and setting it aside solely for the purpose of testing our train model on unseen data. as mentioned in the previous block, you can experiment 🔬 with and choose more sophisticated techniques and make your model better.

Now, since we have our data splitted into train and validation sets, we need to get the corresponding labels separated from the data.
with this step we are all set move to the next step with a prepared dataset.

TRAINING PHASE 🏋️¶

Define the Model and Train¶

Define the Model¶

We have fixed our data and now we are ready to train our model.
There are a ton of regressors to choose from some being Linear Regression, , Random Forests, Decision Trees, etc.🧐
Remember that there are no hard-laid rules here. you can mix and match classifiers, it is advisable to read up on the numerous techniques and choose the best fit for your solution , experimentation is the key.
A good model does not depend solely on the classifier but also on the features you choose. So make sure to analyse and understand your data well and move forward with a clear view of the problem at hand. you can gain important insight from here.🧐

In [0]:

regressor = LinearRegression()  

# from sklearn import tree
# regressor = tree.DecisionTreeRegressor()

We have used Linear Regression as a model here and set few of the parameteres.
One can set more parameters and increase the performance. To see the list of parameters visit here.
Do keep in mind there exist sophisticated techniques for everything, the key as quoted earlier is to search them and experiment to fit your implementation.
Also given Decision Tree examples. Check out Decision Tree's parameters here

Train the Model¶

In [0]:

regressor.fit(X_train, y_train)

Out[0]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Check which variables have the most impact¶

We now take this time to identify the columns that have the most impact. This is used to remove the columns that have negligble impact on the data and improve our model.

In [0]:

coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
coeff_df.head()

Out[0]:

	Coefficient
timedelta	1.371829
n_tokens_title	134.279025
n_tokens_content	0.321616
n_unique_tokens	4477.371557
n_non_stop_words	-2579.368312

Validation Phase 🤔¶

Wonder how well your model learned! Lets check it.

Predict on Validation¶

Now we predict using our trained model on the validation set we created and evaluate our model on unforeseen data.

In [0]:

y_pred = regressor.predict(X_val)

Evaluate the Performance¶

We have used basic metrics to quantify the performance of our model.
This is a crucial step, you should reason out the metrics and take hints to improve aspects of your model.
Do read up on the meaning and use of different metrics. there exist more metrics and measures, you should learn to use them correctly with respect to the solution,dataset and other factors.
MAE and RMSE are the metrics for this challenge.

In [0]:

print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))

Mean Absolute Error: 3174.901688002103
Mean Squared Error: 168520453.62617025
Root Mean Squared Error: 12981.542806083191

Testing Phase 😅¶

We are almost done. We trained and validated on the training data. Now its the time to predict on test set and make a submission.

Load Test Set¶

Load the test data on which final submission is to be made.

In [0]:

test_data = pd.read_csv('data/test.csv')

In [0]:

test_data = test_data.drop('url',1)
test_data.head()

Out[0]:

	timedelta	n_tokens_title	n_tokens_content	n_unique_tokens	n_non_stop_words	n_non_stop_unique_tokens	num_hrefs	num_self_hrefs	num_imgs	num_videos	average_token_length	num_keywords	data_channel_is_bus	data_channel_is_tech	kw_min_min	kw_max_min	kw_avg_min	kw_min_max	kw_max_max	kw_avg_max	kw_min_avg	kw_max_avg	kw_avg_avg	self_reference_min_shares	self_reference_max_shares	self_reference_avg_sharess	weekday_is_tuesday	weekday_is_wednesday	weekday_is_thursday	LDA_00	LDA_01	LDA_02	LDA_03	LDA_04	global_subjectivity	global_sentiment_polarity	global_rate_positive_words	global_rate_negative_words	rate_positive_words	rate_negative_words	avg_positive_polarity	min_positive_polarity	max_positive_polarity	avg_negative_polarity	min_negative_polarity	max_negative_polarity	title_subjectivity	title_sentiment_polarity	abs_title_subjectivity	abs_title_sentiment_polarity
0	121.0	12.0	1015.0	0.422018	1.0	0.545031	10.0	6.0	33.0	1.0	4.656158	4.0	1.0	0.0	-1.0	263.0	110.500000	6500.0	843300.0	398350.000000	1809.075	3483.806797	2729.047648	1100.0	22100.0	6475.0	1.0	0.0	0.0	0.331582	0.050050	0.050035	0.050000	0.518333	0.471175	0.159889	0.041379	0.008867	0.823529	0.176471	0.333534	0.100000	0.8	-0.160714	-0.50	-0.071429	0.0	0.00	0.5	0.00
1	532.0	9.0	503.0	0.569697	1.0	0.737542	9.0	0.0	1.0	1.0	4.576541	10.0	0.0	1.0	4.0	3200.0	524.750000	0.0	843300.0	117960.000000	0.000	4228.114286	2387.526307	0.0	0.0	0.0	0.0	0.0	1.0	0.020007	0.020008	0.325602	0.020004	0.614379	0.477791	0.123520	0.033797	0.019881	0.629630	0.370370	0.419786	0.136364	1.0	-0.157500	-0.25	-0.100000	0.0	0.00	0.5	0.00
2	435.0	9.0	232.0	0.646018	1.0	0.748428	12.0	3.0	4.0	1.0	4.935345	6.0	0.0	0.0	4.0	939.0	198.666667	970.0	843300.0	573878.333333	954.500	6192.239067	4385.022237	1400.0	58800.0	30100.0	0.0	1.0	0.0	0.033334	0.033697	0.033333	0.866302	0.033333	0.522234	-0.163235	0.017241	0.043103	0.285714	0.714286	0.468750	0.375000	0.5	-0.427500	-1.00	-0.187500	0.0	0.00	0.5	0.00
3	134.0	12.0	171.0	0.722892	1.0	0.867925	9.0	5.0	0.0	1.0	4.970760	6.0	1.0	0.0	-1.0	2100.0	444.166667	5600.0	843300.0	311033.333333	2076.520	4529.427500	3269.856640	974.0	5600.0	2574.8	0.0	1.0	0.0	0.700107	0.033335	0.033334	0.199402	0.033822	0.405128	-0.006410	0.011696	0.029240	0.285714	0.714286	0.500000	0.500000	0.5	-0.216667	-0.25	-0.166667	0.4	-0.25	0.1	0.25
4	728.0	11.0	286.0	0.652632	1.0	0.800000	5.0	2.0	0.0	0.0	5.006993	8.0	0.0	1.0	217.0	552.0	356.200000	0.0	28000.0	6830.125000	0.000	2240.536313	976.913444	822.0	822.0	822.0	0.0	0.0	1.0	0.214708	0.025062	0.025016	0.025187	0.710028	0.418036	0.060089	0.034965	0.024476	0.588235	0.411765	0.303429	0.100000	0.6	-0.251786	-0.50	-0.100000	0.2	-0.10	0.3	0.10

Predict on test set¶

Time for the moment of truth! Predict on test set and time to make the submission.

In [0]:

y_test = regressor.predict(test_data)

Since its integer regression, convert to integers

In [0]:

y_inttest = [int(i) for i in y_test]
y_inttest = np.asarray(y_inttest)

Save the prediction to csv¶

In [0]:

df = pd.DataFrame(y_inttest,columns=[' shares'])
df.to_csv('submission.csv',index=False)

🚧 Note :

Do take a look at the submission format.
The submission file should contain a header.
Follow all submission guidelines strictly to avoid inconvenience.

To download the generated in collab csv run the below command¶

In [0]:

try:
  from google.colab import files
  files.download('submission.csv')
except ImportError as e:
  print("Only for Collab")

AIcrowd Forum

Baseline submission for OLNWP

Getting Started Code for OLNWP Educational Challenge ¶

Author - Pulkit Gera¶

Download data¶

Import necessary packages¶

Load Data¶

Clean and analyse the data¶

Split Data into Train and Validation 🔪¶

TRAINING PHASE 🏋️¶

Define the Model and Train¶

Define the Model¶

Train the Model¶

Check which variables have the most impact¶

Validation Phase 🤔¶

Predict on Validation¶

Evaluate the Performance¶

Testing Phase 😅¶

Load Test Set¶

Predict on test set¶

Save the prediction to csv¶

To download the generated in collab csv run the below command¶

Well Done! 👍 We are all set to make a submission and see you name on leaderborad. Let navigate to challenge page and make one.¶

Baseline submission for OLNWP

Getting Started Code for OLNWP Educational Challenge¶

Author - Pulkit Gera¶

Download data¶

Import necessary packages¶

Load Data¶

Clean and analyse the data¶

Split Data into Train and Validation 🔪¶

TRAINING PHASE 🏋️¶

Define the Model and Train¶

Define the Model¶

Train the Model¶

Check which variables have the most impact¶

Validation Phase 🤔¶

Predict on Validation¶

Evaluate the Performance¶

Testing Phase 😅¶

Load Test Set¶

Predict on test set¶

Save the prediction to csv¶

To download the generated in collab csv run the below command¶

Well Done! 👍 We are all set to make a submission and see you name on leaderborad. Let navigate to challenge page and make one.¶

Getting Started Code for OLNWP Educational Challenge ¶