Baseline Submission

Baseline Submission for the Challenge SPCRT

Open In Colab

Import necessary packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Download Dataset

In [ ]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_spcrt/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_spcrt/data/public/train.csv

Load Data

In [2]:
train_data = pd.read_csv('train.csv')

Clean and analyse the data

In [4]:
train_data.head()
Out[4]:
number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass range_atomic_mass wtd_range_atomic_mass std_atomic_mass ... wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence wtd_std_Valence critical_temp
0 3 86.299100 65.789610 64.984139 49.765400 0.836621 1.013759 146.88130 20.950610 63.713516 ... 3.500000 3.301927 3.464102 1.088900 0.971342 1 1.400000 0.471405 0.500000 4.50
1 5 72.952854 56.414763 59.186241 35.639703 1.445795 1.041520 122.90607 35.383159 40.250192 ... 2.257143 2.168944 2.219783 1.594167 1.087480 1 1.131429 0.400000 0.437059 7.60
2 6 82.318112 99.033554 53.069787 71.259834 1.427749 1.324091 192.98100 40.196140 70.933858 ... 4.300000 3.203101 3.772087 1.647214 1.510613 5 1.580000 1.950783 1.791647 3.01
3 4 57.444449 60.476650 56.067907 58.936797 1.362775 1.128041 34.84360 27.021980 12.367487 ... 3.650000 3.309751 3.442623 1.333736 1.089489 3 1.800000 1.118034 1.194780 14.10
4 4 76.517718 56.808817 59.310096 35.773432 1.197273 0.981880 122.90607 34.833160 44.289459 ... 2.264286 2.213364 2.226222 1.368922 1.048834 1 1.100000 0.433013 0.440952 36.80

5 rows × 82 columns

In [5]:
train_data.describe()
Out[5]:
number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass range_atomic_mass wtd_range_atomic_mass std_atomic_mass ... wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence wtd_std_Valence critical_temp
count 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 ... 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000 18073.000000
mean 4.116527 87.495853 72.915281 71.193951 58.444208 1.165612 1.064409 115.732133 33.213727 44.442844 ... 3.152312 3.056546 3.054714 1.296028 1.054028 2.044708 1.481685 0.841078 0.676041 34.492796
std 1.439625 29.586564 33.320437 30.920472 36.470563 0.365019 0.401233 54.718595 26.886071 20.068666 ... 1.189356 1.043451 1.172383 0.392761 0.380274 1.242861 0.976455 0.485247 0.455984 34.307997
min 1.000000 6.941000 6.941000 5.685033 3.193745 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000210
25% 3.000000 72.451240 52.177725 58.001648 35.258590 0.969858 0.777619 78.353150 16.830450 32.890369 ... 2.118056 2.279705 2.092115 1.060857 0.778998 1.000000 0.920286 0.471405 0.308515 5.400000
50% 4.000000 84.841880 60.786693 66.361592 39.898482 1.199541 1.146366 122.906070 26.658401 45.129500 ... 2.618182 2.615321 2.433589 1.368922 1.165410 2.000000 1.062667 0.800000 0.500000 20.000000
75% 5.000000 100.351275 85.994130 78.019689 73.097796 1.444537 1.360442 155.006000 38.360375 59.663892 ... 4.030000 3.741657 3.920517 1.589027 1.331926 3.000000 1.920000 1.200000 1.021023 63.000000
max 9.000000 208.980400 208.980400 208.980400 208.980400 1.983797 1.958203 207.972460 205.589910 101.019700 ... 7.000000 7.000000 7.000000 2.141963 1.949739 6.000000 6.992200 3.000000 3.000000 185.000000

8 rows × 82 columns

Split Data for Train and Validation

In [6]:
X = train_data.drop('critical_temp',1)
y = train_data['critical_temp']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Define the Classifier and Train

In [7]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[7]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Check which variables have the most impact

In [8]:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
coeff_df.head()
Out[8]:
Coefficient
number_of_elements -4.202422
mean_atomic_mass 0.833105
wtd_mean_atomic_mass -0.881193
gmean_atomic_mass -0.510610
wtd_gmean_atomic_mass 0.642180

Predict on validation

In [9]:
y_pred = regressor.predict(X_val)
In [11]:
df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)

Evaluate the Performance

In [12]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))
Mean Absolute Error: 13.42086725495139
Mean Squared Error: 323.28465055058496
Root Mean Squared Error: 17.98011820179681

Load Test Set

In [13]:
test_data = pd.read_csv('test.csv')
In [14]:
test_data.head()
Out[14]:
number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass range_atomic_mass wtd_range_atomic_mass std_atomic_mass ... mean_Valence wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence wtd_std_Valence
0 2 82.768190 87.837285 82.144935 87.360109 0.685627 0.509575 20.27638 51.522285 10.138190 ... 4.50 4.750000 4.472136 4.728708 0.686962 0.514653 1 2.750000 0.500000 0.433013
1 4 76.444563 81.456750 59.356672 68.229617 1.199541 1.108189 121.32760 36.950657 43.823354 ... 2.25 2.142857 2.213364 2.119268 1.368922 1.309526 1 0.571429 0.433013 0.349927
2 5 88.936744 51.090431 70.358975 34.783991 1.445824 1.525092 122.90607 10.438667 46.482335 ... 2.40 2.114679 2.352158 2.095193 1.589027 1.314189 1 0.967890 0.489898 0.318634
3 4 76.517718 56.149432 59.310096 35.562124 1.197273 1.042132 122.90607 31.920690 44.289459 ... 2.25 2.251429 2.213364 2.214646 1.368922 1.078855 1 1.074286 0.433013 0.433834
4 3 104.608490 89.558979 101.719818 88.481210 1.070258 0.944284 59.94547 33.541423 25.225148 ... 5.00 5.811245 4.762203 5.743954 1.054920 0.803990 3 3.024096 1.414214 0.728448

5 rows × 81 columns

Predict on test set

In [15]:
y_test = regressor.predict(test_data)

Save it in correct format

In [17]:
df = pd.DataFrame(y_test,columns=['critical_temp'])
df.to_csv('submission.csv',index=False)

To participate in the challenge click here

1 Like