Baseline Submission for the challenge

Baseline Submission for the Challenge YPMSD

Open In Colab

Import necessary packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Download Dataset

In [ ]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_sngyr/train.zip
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_ypmsd/data/public/test.csv
!unzip train.zip

Load Data

In [2]:
train_data = pd.read_csv('train.csv')

Clean and analyse the data

In [3]:
train_data.head()
Out[3]:
year timbre_mean_0 timbre_mean_1 timbre_mean_2 timbre_mean_3 timbre_mean_4 timbre_mean_5 timbre_mean_6 timbre_mean_7 timbre_mean_8 ... timbre_cov_68 timbre_cov_69 timbre_cov_70 timbre_cov_71 timbre_cov_72 timbre_cov_73 timbre_cov_74 timbre_cov_75 timbre_cov_76 timbre_cov_77
0 2001 49.94357 21.47114 73.07750 8.74861 -17.40628 -13.09905 -25.01202 -12.23257 7.83089 ... 13.01620 -54.40548 58.99367 15.37344 1.11144 -23.08793 68.40795 -1.82223 -27.46348 2.26327
1 2001 48.73215 18.42930 70.32679 12.94636 -10.32437 -24.83777 8.76630 -0.92019 18.76548 ... 5.66812 -19.68073 33.04964 42.87836 -9.90378 -32.22788 70.49388 12.04941 58.43453 26.92061
2 2001 50.95714 31.85602 55.81851 13.41693 -6.57898 -18.54940 -3.27872 -2.35035 16.07017 ... 3.03800 26.05866 -50.92779 10.93792 -0.07568 43.20130 -115.00698 -0.05859 39.67068 -0.66345
3 2001 48.24750 -1.89837 36.29772 2.58776 0.97170 -26.21683 5.05097 -10.34124 3.55005 ... 34.57337 -171.70734 -16.96705 -46.67617 -12.51516 82.58061 -72.08993 9.90558 199.62971 18.85382
4 2001 50.97020 42.20998 67.09964 8.46791 -15.85279 -16.81409 -12.48207 -9.37636 12.63699 ... 9.92661 -55.95724 64.92712 -17.72522 -1.49237 -7.50035 51.76631 7.88713 55.66926 28.74903

5 rows × 91 columns

In [4]:
train_data.describe()
Out[4]:
year timbre_mean_0 timbre_mean_1 timbre_mean_2 timbre_mean_3 timbre_mean_4 timbre_mean_5 timbre_mean_6 timbre_mean_7 timbre_mean_8 ... timbre_cov_68 timbre_cov_69 timbre_cov_70 timbre_cov_71 timbre_cov_72 timbre_cov_73 timbre_cov_74 timbre_cov_75 timbre_cov_76 timbre_cov_77
count 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 ... 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000 463715.000000
mean 1998.386095 43.385488 1.261091 8.650195 1.130763 -6.512725 -9.565527 -2.384609 -1.793722 3.714584 ... 15.743361 -73.067753 41.423976 37.780868 0.345259 17.599280 -26.364826 4.444985 19.739307 1.323326
std 10.939767 6.079139 51.613473 35.264750 16.334672 22.855820 12.836758 14.580245 7.961876 10.579241 ... 32.086356 175.376872 121.794610 94.874474 16.153797 114.336522 174.187892 13.320996 184.843503 22.045404
min 1922.000000 1.749000 -337.092500 -301.005060 -154.183580 -181.953370 -81.794290 -188.214000 -72.503850 -126.479040 ... -437.722030 -4402.376440 -1810.689190 -3098.350310 -341.789120 -3168.924570 -4319.992320 -236.039260 -7458.378150 -318.223330
25% 1994.000000 39.957540 -26.153810 -11.441920 -8.515155 -20.636960 -18.468705 -10.776340 -6.461400 -2.303600 ... -1.798085 -139.062035 -20.918635 -4.711470 -6.758160 -31.563615 -101.396245 -2.572830 -59.598030 -8.813335
50% 2002.000000 44.262570 8.371550 10.470520 -0.691610 -5.992740 -11.208850 -2.047850 -1.735440 3.816840 ... 9.161360 -52.878010 28.709870 33.494550 0.828350 15.554490 -21.123570 3.111120 7.586950 0.052840
75% 2006.000000 47.833650 36.143780 29.741165 8.756995 7.749590 -2.422590 6.515710 2.905130 9.950960 ... 26.248290 13.620660 89.419995 77.674700 8.495715 67.743725 52.299850 9.948955 86.203115 9.670740
max 2011.000000 61.970140 384.065730 322.851430 289.527430 262.068870 119.815590 172.402680 105.210280 146.297950 ... 840.973380 4469.454870 3210.701700 1672.647100 260.544900 3662.065650 2833.608950 463.419500 7393.398440 600.766240

8 rows × 91 columns

Split Data for Train and Validation

In [5]:
X = train_data.drop('year',1)
y = train_data['year']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Define the Classifier and Train

In [6]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[6]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Check which variables have the most impact

In [7]:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
coeff_df.head()
Out[7]:
Coefficient
timbre_mean_0 0.873376
timbre_mean_1 -0.055835
timbre_mean_2 -0.043576
timbre_mean_3 0.004539
timbre_mean_4 -0.015032

Predict on validation

In [8]:
y_pred = regressor.predict(X_val)
In [9]:
df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)
df1
Out[9]:
Actual Predicted
332595 2004 2002.979558
230573 1989 1996.446079
364530 1987 1995.333451
82857 2002 1998.163320
108108 1971 1998.303355
446568 2005 2000.499458
27815 2004 1995.818434
214974 1997 1999.666288
304899 2006 2005.025704
257881 2007 1998.581968
144054 2004 2001.671307
292186 1993 1991.859040
260055 1984 1989.047164
50427 2007 2001.741216
380270 1975 1994.628844
43122 2003 2004.481078
431264 2000 1999.890466
75602 2009 1998.807686
461034 1987 1989.828458
336805 1996 1995.843295
375889 1999 2001.135930
182008 2008 2008.060799
283427 2002 1998.591879
345613 1955 1988.051633
97235 1999 1988.706992

Evaluate the Performance

In [10]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))
Mean Absolute Error: 6.77395050034565
Mean Squared Error: 90.87071514117896
Root Mean Squared Error: 9.53261323778422

Load Test Set

In [11]:
test_data = pd.read_csv('test.csv')
In [12]:
test_data.head()
Out[12]:
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 ... Unnamed: 81 Unnamed: 82 Unnamed: 83 Unnamed: 84 Unnamed: 85 Unnamed: 86 Unnamed: 87 Unnamed: 88 Unnamed: 89 Unnamed: 90
0 45.44200 -30.74976 31.78587 4.63569 -15.14894 0.23370 -11.97968 -9.59708 6.48111 -8.89073 ... -8.84046 -0.15439 137.44210 77.54739 -4.22875 -61.92657 -33.52722 -3.86253 36.42400 7.17309
1 52.67814 -2.88914 43.95268 -1.39209 -14.93379 -15.86877 1.19379 0.31401 -4.44235 -5.78934 ... -5.74356 -42.57910 -2.91103 48.72805 -3.08183 -9.38888 -7.27179 -4.00966 -68.96211 -5.21525
2 45.74235 12.02291 11.03009 -11.60763 11.80054 -11.12389 -5.39058 -1.11981 -7.74086 -3.33421 ... -4.70606 -24.22599 -35.22686 27.77729 15.38934 58.20036 -61.12698 -10.92522 26.75348 -5.78743
3 52.55883 2.87222 27.38848 -5.76235 -15.35766 -15.01592 -5.86893 -0.31447 -5.06922 -4.62734 ... -8.35215 -16.86791 -10.58277 40.10173 -0.54005 -11.54746 -45.35860 -4.55694 -43.17368 -3.33725
4 51.34809 9.02702 25.33757 -6.62537 0.03367 -12.69565 -3.13400 2.98649 -6.71750 -1.85804 ... -6.87366 -20.03371 -66.38940 50.56569 0.27747 67.05657 -55.58846 -7.50859 28.23511 -0.72045

5 rows × 90 columns

Predict on test set

In [13]:
y_test = regressor.predict(test_data)

Since its integer regression, convert to integers

In [14]:
y_inttest = [int(i) for i in y_test]
y_inttest = np.asarray(y_inttest)

Save it in correct format

In [15]:
df = pd.DataFrame(y_inttest,columns=['year'])
df.to_csv('submission.csv',index=False)

To participate in the challenge click here