Baseline Submission for DBSRA

Baseline submission for the challenge DBSRA

Open In Colab

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

Download data

In [ ]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_dbsra/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_dbsra/data/public/train.csv

Load Data

In [2]:
train_data = pd.read_csv('train.csv')

Clean and Analyse Data

In [3]:
train_data = train_data.drop('encounter_id',1)
train_data = train_data.drop('patient_nbr',1)
train_data.head()
Out[3]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty ... citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
0 AfricanAmerican Female [70-80) ? 1 1 7 2 ? InternalMedicine ... No Steady No No No No No No Yes 1
1 Caucasian Female [90-100) ? 3 1 1 8 SP Pulmonology ... No Down No No No No No Ch Yes 1
2 Caucasian Female [80-90) ? 1 2 7 1 MC Osteopath ... No Steady No No No No No No Yes 0
3 Caucasian Male [60-70) ? 3 1 6 6 MC Radiologist ... No Steady No No No No No Ch Yes 0
4 ? Female [70-80) ? 1 3 6 3 UN InternalMedicine ... No No No No No No No No No 0

5 rows × 48 columns

Since most of the columns have categorical columns we have to convert it into integers. The most basic way is to do an Ordinal Mapping. Note: Here we have not replaced question marks with some other data and they are also accounted into ordinal mapping.

In [5]:
labelencoder = LabelEncoder()
n_train_data = train_data
for col in train_data.columns:
    s = train_data[col]
    if s.dtype == 'O':
        s = labelencoder.fit_transform(s)
        n_train_data[col] = s
n_train_data.head()
Out[5]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty ... citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
0 1 0 7 1 1 1 7 2 0 19 ... 0 2 1 0 0 0 0 1 1 1
1 3 0 9 1 3 1 1 8 15 51 ... 0 0 1 0 0 0 0 0 1 1
2 3 0 8 1 1 2 7 1 8 30 ... 0 2 1 0 0 0 0 1 1 0
3 3 1 6 1 3 1 6 6 8 52 ... 0 2 1 0 0 0 0 0 1 0
4 0 0 7 1 1 3 6 3 16 19 ... 0 1 1 0 0 0 0 1 0 0

5 rows × 48 columns

Split Data into Train and Validation

In [6]:
X = n_train_data.drop('readmitted',1)
y = n_train_data['readmitted']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Define the Classifier and Train

In [7]:
classifier = LogisticRegression()
classifier.fit(X_train,y_train)
/home/gera/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/home/gera/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
Out[7]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Predict on Validation

In [8]:
y_pred = classifier.predict(X_val)
In [9]:
df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)
df1
Out[9]:
Actual Predicted
26342 1 0
59142 1 0
57537 1 0
58128 0 0
29821 1 0
62897 0 0
43572 0 0
62329 2 0
44309 0 0
20882 0 0
49075 0 0
20668 0 0
76856 1 0
32858 1 1
74292 1 0
80549 1 0
8588 1 0
57768 1 0
10658 1 0
51569 0 0
59914 1 0
32874 0 0
54656 1 0
77456 0 0
35300 0 0

Evaluate the Performance

In [10]:
print('F1 score Score:', metrics.f1_score(y_val, y_pred,average='micro'))
F1 score Score: 0.5688110513843131

Load Test Set

In [11]:
test_data = pd.read_csv('test.csv')
In [12]:
test_data = test_data.drop('encounter_id',1)
test_data = test_data.drop('patient_nbr',1)
n_test_data = test_data
for col in test_data.columns:
    s = test_data[col]
    if s.dtype == 'O':
        s = labelencoder.fit_transform(s)
        n_test_data[col] = s
n_test_data.head()
Out[12]:
race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty ... examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed
0 3 0 7 1 1 1 6 11 15 16 ... 0 0 2 1 0 0 0 0 1 1
1 3 1 5 1 1 1 1 1 6 0 ... 0 0 1 1 0 0 0 0 1 1
2 3 0 6 1 3 6 1 4 6 0 ... 0 0 1 1 0 0 0 0 1 1
3 3 1 3 1 2 1 1 12 4 10 ... 0 0 1 1 0 0 0 0 1 1
4 1 0 6 1 1 2 7 1 0 0 ... 0 0 1 1 0 0 0 0 1 1

5 rows × 47 columns

Predict Test Set

In [13]:
y_test = classifier.predict(test_data)
In [14]:
df = pd.DataFrame(y_test,columns=['readmitted'])
df.to_csv('submission.csv',index=False)

To participate in the challenge click here

In [ ]: