Baseline SPCRT

aicrowd-bot · March 27, 2020, 10:43am

Notebook

Baseline Submission for the Challenge SPCRT¶

Author - Pulkit Gera¶

In [ ]:

!pip install numpy
!pip install pandas
!pip install sklearn

Import necessary packages¶

In [1]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Download data¶

The first step is to download out train test data. We will be training a classifier on the train data and make predictions on test data. We submit our predictions

In [ ]:

!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_spcrt/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_spcrt/data/public/train.csv

Load Data¶

We use pandas library to load our data. Pandas loads them into dataframes which helps us analyze our data easily. Learn more about it here

In [2]:

train_data = pd.read_csv('train.csv')

Clean and analyse the data¶

In [4]:

train_data.head()

Out[4]:

	number_of_elements	mean_atomic_mass	wtd_mean_atomic_mass	gmean_atomic_mass	wtd_gmean_atomic_mass	entropy_atomic_mass	wtd_entropy_atomic_mass	range_atomic_mass	wtd_range_atomic_mass	std_atomic_mass	...	wtd_mean_Valence	gmean_Valence	wtd_gmean_Valence	entropy_Valence	wtd_entropy_Valence	range_Valence	wtd_range_Valence	std_Valence	wtd_std_Valence	critical_temp
0	3	86.299100	65.789610	64.984139	49.765400	0.836621	1.013759	146.88130	20.950610	63.713516	...	3.500000	3.301927	3.464102	1.088900	0.971342	1	1.400000	0.471405	0.500000	4.50
1	5	72.952854	56.414763	59.186241	35.639703	1.445795	1.041520	122.90607	35.383159	40.250192	...	2.257143	2.168944	2.219783	1.594167	1.087480	1	1.131429	0.400000	0.437059	7.60
2	6	82.318112	99.033554	53.069787	71.259834	1.427749	1.324091	192.98100	40.196140	70.933858	...	4.300000	3.203101	3.772087	1.647214	1.510613	5	1.580000	1.950783	1.791647	3.01
3	4	57.444449	60.476650	56.067907	58.936797	1.362775	1.128041	34.84360	27.021980	12.367487	...	3.650000	3.309751	3.442623	1.333736	1.089489	3	1.800000	1.118034	1.194780	14.10
4	4	76.517718	56.808817	59.310096	35.773432	1.197273	0.981880	122.90607	34.833160	44.289459	...	2.264286	2.213364	2.226222	1.368922	1.048834	1	1.100000	0.433013	0.440952	36.80

5 rows × 82 columns

Here we use the describe function to get an understanding of the data. It shows us the distribution for all the columns. You can use more functions like info() to get useful info.

In [5]:

train_data.describe()

Out[5]:

	number_of_elements	mean_atomic_mass	wtd_mean_atomic_mass	gmean_atomic_mass	wtd_gmean_atomic_mass	entropy_atomic_mass	wtd_entropy_atomic_mass	range_atomic_mass	wtd_range_atomic_mass	std_atomic_mass	...	wtd_mean_Valence	gmean_Valence	wtd_gmean_Valence	entropy_Valence	wtd_entropy_Valence	range_Valence	wtd_range_Valence	std_Valence	wtd_std_Valence	critical_temp
count	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	...	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000	18073.000000
mean	4.116527	87.495853	72.915281	71.193951	58.444208	1.165612	1.064409	115.732133	33.213727	44.442844	...	3.152312	3.056546	3.054714	1.296028	1.054028	2.044708	1.481685	0.841078	0.676041	34.492796
std	1.439625	29.586564	33.320437	30.920472	36.470563	0.365019	0.401233	54.718595	26.886071	20.068666	...	1.189356	1.043451	1.172383	0.392761	0.380274	1.242861	0.976455	0.485247	0.455984	34.307997
min	1.000000	6.941000	6.941000	5.685033	3.193745	0.000000	0.000000	0.000000	0.000000	0.000000	...	1.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000210
25%	3.000000	72.451240	52.177725	58.001648	35.258590	0.969858	0.777619	78.353150	16.830450	32.890369	...	2.118056	2.279705	2.092115	1.060857	0.778998	1.000000	0.920286	0.471405	0.308515	5.400000
50%	4.000000	84.841880	60.786693	66.361592	39.898482	1.199541	1.146366	122.906070	26.658401	45.129500	...	2.618182	2.615321	2.433589	1.368922	1.165410	2.000000	1.062667	0.800000	0.500000	20.000000
75%	5.000000	100.351275	85.994130	78.019689	73.097796	1.444537	1.360442	155.006000	38.360375	59.663892	...	4.030000	3.741657	3.920517	1.589027	1.331926	3.000000	1.920000	1.200000	1.021023	63.000000
max	9.000000	208.980400	208.980400	208.980400	208.980400	1.983797	1.958203	207.972460	205.589910	101.019700	...	7.000000	7.000000	7.000000	2.141963	1.949739	6.000000	6.992200	3.000000	3.000000	185.000000

8 rows × 82 columns

Split Data into Train and Validation¶

Now we want to see how well our model is performing, but we dont have the test data labels with us to check. What do we do ? So we split our dataset into train and validation. The idea is that we test our classifier on validation set in order to get an idea of how well our classifier works. This way we can also ensure that we dont overfit on the train dataset. There are many ways to do validation like k-fold,leave one out, etc

In [6]:

X = train_data.drop('critical_temp',1)
y = train_data['critical_temp']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Define the Model and Train¶

Now we come to the juicy part. We have fixed our data and now we train a model. The model will learn the function by looking at the inputs and corresponding outputs. There are a ton of models to choose from some being Linear Regression, Random Forests, Decision Trees, etc.
Tip: A good model doesnt depend solely on the model but on the features(columns) you choose. So make sure to play with your data and keep only whats important.

In [7]:

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

# from sklearn import tree
# clf = tree.DecisionTreeRegressor()
# clf = clf.fit(X_train, y_train)

Out[7]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

We have used Linear Regression as a model here and set few of the parameteres. But one can set more parameters and increase the performance. To see the list of parameters visit here.
Also given Decision Tree examples. Check out Decision Tree's parameters here

Check which variables have the most impact¶

We now take this time to identify the columns that have the most impact. This is used to remove the columns that have negligble impact on the data and improve our model.

In [8]:

coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
coeff_df.head()

Out[8]:

	Coefficient
number_of_elements	-4.202422
mean_atomic_mass	0.833105
wtd_mean_atomic_mass	-0.881193
gmean_atomic_mass	-0.510610
wtd_gmean_atomic_mass	0.642180

Predict on Validation¶

Now we predict our trained model on the validation set and evaluate our model

In [9]:

y_pred = regressor.predict(X_val)

In [11]:

df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)

Evaluate the Performance¶

We use the same metrics as that will be used for the test set.
MAE and RMSE are the metrics for this challenge

In [12]:

print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))

Mean Absolute Error: 13.42086725495139
Mean Squared Error: 323.28465055058496
Root Mean Squared Error: 17.98011820179681

Load Test Set¶

Load the test data now

In [13]:

test_data = pd.read_csv('test.csv')

In [14]:

test_data.head()

Out[14]:

	number_of_elements	mean_atomic_mass	wtd_mean_atomic_mass	gmean_atomic_mass	wtd_gmean_atomic_mass	entropy_atomic_mass	wtd_entropy_atomic_mass	range_atomic_mass	wtd_range_atomic_mass	std_atomic_mass	...	mean_Valence	wtd_mean_Valence	gmean_Valence	wtd_gmean_Valence	entropy_Valence	wtd_entropy_Valence	range_Valence	wtd_range_Valence	std_Valence	wtd_std_Valence
0	2	82.768190	87.837285	82.144935	87.360109	0.685627	0.509575	20.27638	51.522285	10.138190	...	4.50	4.750000	4.472136	4.728708	0.686962	0.514653	1	2.750000	0.500000	0.433013
1	4	76.444563	81.456750	59.356672	68.229617	1.199541	1.108189	121.32760	36.950657	43.823354	...	2.25	2.142857	2.213364	2.119268	1.368922	1.309526	1	0.571429	0.433013	0.349927
2	5	88.936744	51.090431	70.358975	34.783991	1.445824	1.525092	122.90607	10.438667	46.482335	...	2.40	2.114679	2.352158	2.095193	1.589027	1.314189	1	0.967890	0.489898	0.318634
3	4	76.517718	56.149432	59.310096	35.562124	1.197273	1.042132	122.90607	31.920690	44.289459	...	2.25	2.251429	2.213364	2.214646	1.368922	1.078855	1	1.074286	0.433013	0.433834
4	3	104.608490	89.558979	101.719818	88.481210	1.070258	0.944284	59.94547	33.541423	25.225148	...	5.00	5.811245	4.762203	5.743954	1.054920	0.803990	3	3.024096	1.414214	0.728448

5 rows × 81 columns

Predict on test set¶

Time for the moment of truth! Predict on test set and time to make the submission.

In [15]:

y_test = regressor.predict(test_data)

Save it in correct format¶

In [17]:

df = pd.DataFrame(y_test,columns=['critical_temp'])
df.to_csv('submission.csv',index=False)

To download the generated in collab csv run the below command¶

In [ ]:

from google.colab import files
files.download('submission.csv')

To participate in the challenge click here