!pip install numpy
!pip install pandas
!pip install sklearn
Import necessary packages¶
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
Download data¶
The first step is to download out train test data. We will be training a classifier on the train data and make predictions on test data. We submit our predictions
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_spcrt/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_spcrt/data/public/train.csv
train_data = pd.read_csv('train.csv')
Clean and analyse the data¶
train_data.head()
number_of_elements | mean_atomic_mass | wtd_mean_atomic_mass | gmean_atomic_mass | wtd_gmean_atomic_mass | entropy_atomic_mass | wtd_entropy_atomic_mass | range_atomic_mass | wtd_range_atomic_mass | std_atomic_mass | ... | wtd_mean_Valence | gmean_Valence | wtd_gmean_Valence | entropy_Valence | wtd_entropy_Valence | range_Valence | wtd_range_Valence | std_Valence | wtd_std_Valence | critical_temp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 86.299100 | 65.789610 | 64.984139 | 49.765400 | 0.836621 | 1.013759 | 146.88130 | 20.950610 | 63.713516 | ... | 3.500000 | 3.301927 | 3.464102 | 1.088900 | 0.971342 | 1 | 1.400000 | 0.471405 | 0.500000 | 4.50 |
1 | 5 | 72.952854 | 56.414763 | 59.186241 | 35.639703 | 1.445795 | 1.041520 | 122.90607 | 35.383159 | 40.250192 | ... | 2.257143 | 2.168944 | 2.219783 | 1.594167 | 1.087480 | 1 | 1.131429 | 0.400000 | 0.437059 | 7.60 |
2 | 6 | 82.318112 | 99.033554 | 53.069787 | 71.259834 | 1.427749 | 1.324091 | 192.98100 | 40.196140 | 70.933858 | ... | 4.300000 | 3.203101 | 3.772087 | 1.647214 | 1.510613 | 5 | 1.580000 | 1.950783 | 1.791647 | 3.01 |
3 | 4 | 57.444449 | 60.476650 | 56.067907 | 58.936797 | 1.362775 | 1.128041 | 34.84360 | 27.021980 | 12.367487 | ... | 3.650000 | 3.309751 | 3.442623 | 1.333736 | 1.089489 | 3 | 1.800000 | 1.118034 | 1.194780 | 14.10 |
4 | 4 | 76.517718 | 56.808817 | 59.310096 | 35.773432 | 1.197273 | 0.981880 | 122.90607 | 34.833160 | 44.289459 | ... | 2.264286 | 2.213364 | 2.226222 | 1.368922 | 1.048834 | 1 | 1.100000 | 0.433013 | 0.440952 | 36.80 |
5 rows × 82 columns
Here we use the describe
function to get an understanding of the data. It shows us the distribution for all the columns. You can use more functions like info()
to get useful info.
train_data.describe()
number_of_elements | mean_atomic_mass | wtd_mean_atomic_mass | gmean_atomic_mass | wtd_gmean_atomic_mass | entropy_atomic_mass | wtd_entropy_atomic_mass | range_atomic_mass | wtd_range_atomic_mass | std_atomic_mass | ... | wtd_mean_Valence | gmean_Valence | wtd_gmean_Valence | entropy_Valence | wtd_entropy_Valence | range_Valence | wtd_range_Valence | std_Valence | wtd_std_Valence | critical_temp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | ... | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 | 18073.000000 |
mean | 4.116527 | 87.495853 | 72.915281 | 71.193951 | 58.444208 | 1.165612 | 1.064409 | 115.732133 | 33.213727 | 44.442844 | ... | 3.152312 | 3.056546 | 3.054714 | 1.296028 | 1.054028 | 2.044708 | 1.481685 | 0.841078 | 0.676041 | 34.492796 |
std | 1.439625 | 29.586564 | 33.320437 | 30.920472 | 36.470563 | 0.365019 | 0.401233 | 54.718595 | 26.886071 | 20.068666 | ... | 1.189356 | 1.043451 | 1.172383 | 0.392761 | 0.380274 | 1.242861 | 0.976455 | 0.485247 | 0.455984 | 34.307997 |
min | 1.000000 | 6.941000 | 6.941000 | 5.685033 | 3.193745 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000210 |
25% | 3.000000 | 72.451240 | 52.177725 | 58.001648 | 35.258590 | 0.969858 | 0.777619 | 78.353150 | 16.830450 | 32.890369 | ... | 2.118056 | 2.279705 | 2.092115 | 1.060857 | 0.778998 | 1.000000 | 0.920286 | 0.471405 | 0.308515 | 5.400000 |
50% | 4.000000 | 84.841880 | 60.786693 | 66.361592 | 39.898482 | 1.199541 | 1.146366 | 122.906070 | 26.658401 | 45.129500 | ... | 2.618182 | 2.615321 | 2.433589 | 1.368922 | 1.165410 | 2.000000 | 1.062667 | 0.800000 | 0.500000 | 20.000000 |
75% | 5.000000 | 100.351275 | 85.994130 | 78.019689 | 73.097796 | 1.444537 | 1.360442 | 155.006000 | 38.360375 | 59.663892 | ... | 4.030000 | 3.741657 | 3.920517 | 1.589027 | 1.331926 | 3.000000 | 1.920000 | 1.200000 | 1.021023 | 63.000000 |
max | 9.000000 | 208.980400 | 208.980400 | 208.980400 | 208.980400 | 1.983797 | 1.958203 | 207.972460 | 205.589910 | 101.019700 | ... | 7.000000 | 7.000000 | 7.000000 | 2.141963 | 1.949739 | 6.000000 | 6.992200 | 3.000000 | 3.000000 | 185.000000 |
8 rows × 82 columns
Split Data into Train and Validation¶
Now we want to see how well our model is performing, but we dont have the test data labels with us to check. What do we do ? So we split our dataset into train and validation. The idea is that we test our classifier on validation set in order to get an idea of how well our classifier works. This way we can also ensure that we dont overfit on the train dataset. There are many ways to do validation like k-fold,leave one out, etc
X = train_data.drop('critical_temp',1)
y = train_data['critical_temp']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
Define the Model and Train¶
Now we come to the juicy part. We have fixed our data and now we train a model. The model will learn the function by looking at the inputs and corresponding outputs. There are a ton of models to choose from some being Linear Regression, Random Forests, Decision Trees, etc.
Tip: A good model doesnt depend solely on the model but on the features(columns) you choose. So make sure to play with your data and keep only whats important.
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# from sklearn import tree
# clf = tree.DecisionTreeRegressor()
# clf = clf.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
We have used Linear Regression as a model here and set few of the parameteres. But one can set more parameters and increase the performance. To see the list of parameters visit here.
Also given Decision Tree examples. Check out Decision Tree's parameters here
Check which variables have the most impact¶
We now take this time to identify the columns that have the most impact. This is used to remove the columns that have negligble impact on the data and improve our model.
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df.head()
Coefficient | |
---|---|
number_of_elements | -4.202422 |
mean_atomic_mass | 0.833105 |
wtd_mean_atomic_mass | -0.881193 |
gmean_atomic_mass | -0.510610 |
wtd_gmean_atomic_mass | 0.642180 |
Predict on Validation¶
Now we predict our trained model on the validation set and evaluate our model
y_pred = regressor.predict(X_val)
df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))
Mean Absolute Error: 13.42086725495139 Mean Squared Error: 323.28465055058496 Root Mean Squared Error: 17.98011820179681
Load Test Set¶
Load the test data now
test_data = pd.read_csv('test.csv')
test_data.head()
number_of_elements | mean_atomic_mass | wtd_mean_atomic_mass | gmean_atomic_mass | wtd_gmean_atomic_mass | entropy_atomic_mass | wtd_entropy_atomic_mass | range_atomic_mass | wtd_range_atomic_mass | std_atomic_mass | ... | mean_Valence | wtd_mean_Valence | gmean_Valence | wtd_gmean_Valence | entropy_Valence | wtd_entropy_Valence | range_Valence | wtd_range_Valence | std_Valence | wtd_std_Valence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 82.768190 | 87.837285 | 82.144935 | 87.360109 | 0.685627 | 0.509575 | 20.27638 | 51.522285 | 10.138190 | ... | 4.50 | 4.750000 | 4.472136 | 4.728708 | 0.686962 | 0.514653 | 1 | 2.750000 | 0.500000 | 0.433013 |
1 | 4 | 76.444563 | 81.456750 | 59.356672 | 68.229617 | 1.199541 | 1.108189 | 121.32760 | 36.950657 | 43.823354 | ... | 2.25 | 2.142857 | 2.213364 | 2.119268 | 1.368922 | 1.309526 | 1 | 0.571429 | 0.433013 | 0.349927 |
2 | 5 | 88.936744 | 51.090431 | 70.358975 | 34.783991 | 1.445824 | 1.525092 | 122.90607 | 10.438667 | 46.482335 | ... | 2.40 | 2.114679 | 2.352158 | 2.095193 | 1.589027 | 1.314189 | 1 | 0.967890 | 0.489898 | 0.318634 |
3 | 4 | 76.517718 | 56.149432 | 59.310096 | 35.562124 | 1.197273 | 1.042132 | 122.90607 | 31.920690 | 44.289459 | ... | 2.25 | 2.251429 | 2.213364 | 2.214646 | 1.368922 | 1.078855 | 1 | 1.074286 | 0.433013 | 0.433834 |
4 | 3 | 104.608490 | 89.558979 | 101.719818 | 88.481210 | 1.070258 | 0.944284 | 59.94547 | 33.541423 | 25.225148 | ... | 5.00 | 5.811245 | 4.762203 | 5.743954 | 1.054920 | 0.803990 | 3 | 3.024096 | 1.414214 | 0.728448 |
5 rows × 81 columns
Predict on test set¶
Time for the moment of truth! Predict on test set and time to make the submission.
y_test = regressor.predict(test_data)
Save it in correct format¶
df = pd.DataFrame(y_test,columns=['critical_temp'])
df.to_csv('submission.csv',index=False)
To download the generated in collab csv run the below command¶
from google.colab import files
files.download('submission.csv')
To participate in the challenge click here