Baseline - CPTCHA

ashivani · September 28, 2020, 6:55am

Notebook

Getting Started Code for CPTCHA Challenge on AIcrowd¶

Author : Sanjay Pokkali¶

Download Necessary Packages 📚¶

In [ ]:

!apt update
!apt install tesseract-ocr
!apt install libtesseract-dev

In [ ]:

!pip install numpy
!pip install pandas
!pip install pytesseract
!pip install scikit-learn
!pip install textdistance

Download Data¶

The first step is to download out train test data. We will be training a model on the train data and make predictions on test data. We submit our predictions.

In [ ]:

!rm -rf data
!mkdir data
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/train.tar.gz
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/test.tar.gz
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/train_info.csv
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/test_info.csv
!mkdir data/train 
!mkdir data/test
!tar -C data/ -xvzf train.tar.gz
!tar -C data/ -xvzf test.tar.gz
!mv train_info.csv data/train_info.csv
!mv test_info.csv data/test_info.csv

Import packages¶

In [ ]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from PIL import Image
import pytesseract
import os
import cv2
import matplotlib.pyplot as plt
import textdistance
%matplotlib inline

Load Data¶

We use pandas 🐼 library to load our data.
Pandas loads the data into dataframes and facilitates us to analyse the data.
Learn more about it here 🤓

In [ ]:

train_info_path = "data/train_info.csv"
test_info_path = "data/test_info.csv"

train_images_path = "data/train/"
test_images_path = "data/test/"
train_info = pd.read_csv(train_info_path)
test_info = pd.read_csv(test_info_path)

Visualize the images👀¶

In [ ]:

def plot_image(img_path):
    img = cv2.imread(img_path)
#     print("Shape of the captcha ",img.shape)
    plt.imshow(img)

In [ ]:

fig=plt.figure(figsize=(20,20))
columns = 3
rows = 3
for i in range(1, columns*rows +1):
    img = train_images_path + train_info['filename'][i]
    fig.add_subplot(rows, columns, i)
    plot_image(img)
plt.show()

Split Data into Train and Validation 🔪¶

The next step is to think of a way to test how well our model is performing. we cannot use the test data given as it does not contain the data labels for us to verify.
The workaround this is to split the given training data into training and validation. Typically validation sets give us an idea of how our model will perform on unforeseen data. it is like holding back a chunk of data while training our model and then using it to for the purpose of testing. it is a standard way to fine-tune hyperparameters in a model.
There are multiple ways to split a dataset into validation and training sets. following are two popular ways to go about it, k-fold, leave one out. 🧐
Validation sets are also used to avoid your model from overfitting on the train dataset.

In [ ]:

X_train, X_val= train_test_split(train_info, test_size=0.2, random_state=42)

We have decided to split the data with 20 % as validation and 80 % as training.
To learn more about the train_test_split function click here. 🧐
This is of course the simplest way to validate your model by simply taking a random chunk of the train set and setting it aside solely for the purpose of testing our train model on unseen data. as mentioned in the previous block, you can experiment 🔬 with and choose more sophisticated techniques and make your model better.

Now, since we have our data splitted into train and validation sets, we need to get the corresponding labels separated from the data.
with this step we are all set move to the next step with a prepared dataset.

TRAINING PHASE 🏋️¶

We will use PyTesseract, an Optical Character Recognition library to recognize the characters in the test captcha directly and make a submission in this notebook.But lest see its performace on the train set.

Using PyTesseract on Training Set¶

In [ ]:

labels = []
all_filenames = []

for index,rows in train_info.iterrows():

    i = rows['filename']
    img_path = train_images_path + i
    label = pytesseract.image_to_string(Image.open(img_path))
    #Removing garbage characters
    label = label.replace("\x0c","")
    label = label.replace("\n","")
    labels.append(label)
    all_filenames.append(i)
    print(f'{str(index+1)+"/" + str(train_info.shape[0])}\r',end="")
    
    

labels = np.asarray(labels)
all_filenames = np.asarray(all_filenames)


submission = pd.DataFrame()
submission['filename'] = all_filenames
submission['label'] = labels

Evaluate the Performance¶

Here for evaluation mean over normalised Levenshtein Similarity Score will be used to test the efficiency of the model.

In [ ]:

def cal_lshtein_score(s_true,s_pred):
    if type(s_pred) == type(1.0):
        return 0
    score = textdistance.levenshtein.normalized_similarity(s_true,s_pred)                        
    return score

In [ ]:

lst_scores = []
for idx in range(0,len(train_info)):
    lst_scores.append(cal_lshtein_score(train_info['label'][idx],submission['label'][idx]))

mean_lst_score = np.mean(lst_scores)

print("The mean of normalised Levenshtein Similarity score is " ,mean_lst_score)

Testing Phase¶

Generate Output for Test set¶

In [ ]:

labels = []
all_filenames = []

for index,rows in test_info.iterrows():

    i = rows['filename']
    img_path = test_images_path + i
    label = pytesseract.image_to_string(Image.open(img_path))
    #Removing garbage characters
    label = label.replace("\x0c","")
    label = label.replace("\n","")
    labels.append(label)
    all_filenames.append(i)
    print(f'{str(index+1)+"/" + str(test_info.shape[0])}\r',end="")

labels = np.asarray(labels)
all_filenames = np.asarray(all_filenames)


submission_df = pd.DataFrame()
submission_df['filename'] = all_filenames
submission_df['label'] = labels

Save the prediction to csv¶

In [ ]:

submission_df.to_csv('submission.csv', index=False)

🚧 Note :¶

Do take a look at the submission format.
The submission file should contain a header.
Follow all submission guidelines strictly to avoid inconvenience.

To download the generated csv in colab run the below command¶

In [ ]:

try:
    from google.colab import files
    files.download('submission.csv') 
except:
    print("Option Only avilable in Google Colab")

Well Done! 👍 We are all set to make a submission and see your name on leaderborad. Lets navigate to challenge page and make one.¶

pyanishjain · October 13, 2020, 2:58am

how much acc do you get from this

ashivani · October 13, 2020, 11:44am

Hey,

This is just a very basic approach and it gives an LSS score of 0.014 on the test set.

Regards
Ayush

AdnanZaidi · October 25, 2020, 8:02pm

Nice Job!!! Helpful for beginners