Getting Started Code for CPTCHA Challenge on AIcrowd¶
Author : Sanjay Pokkali¶
Download Necessary Packages 📚¶
In [ ]:
!apt update
!apt install tesseract-ocr
!apt install libtesseract-dev
In [ ]:
!pip install numpy
!pip install pandas
!pip install pytesseract
!pip install scikit-learn
!pip install textdistance
Download Data¶
The first step is to download out train test data. We will be training a model on the train data and make predictions on test data. We submit our predictions.
In [ ]:
!rm -rf data
!mkdir data
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/train.tar.gz
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/test.tar.gz
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/train_info.csv
!wget https://datasets.aicrowd.com/default/aicrowd-practice-challenges/public/cptcha/v0.1/test_info.csv
!mkdir data/train
!mkdir data/test
!tar -C data/ -xvzf train.tar.gz
!tar -C data/ -xvzf test.tar.gz
!mv train_info.csv data/train_info.csv
!mv test_info.csv data/test_info.csv
Import packages¶
In [ ]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from PIL import Image
import pytesseract
import os
import cv2
import matplotlib.pyplot as plt
import textdistance
%matplotlib inline
In [ ]:
train_info_path = "data/train_info.csv"
test_info_path = "data/test_info.csv"
train_images_path = "data/train/"
test_images_path = "data/test/"
train_info = pd.read_csv(train_info_path)
test_info = pd.read_csv(test_info_path)
Visualize the images👀¶
In [ ]:
def plot_image(img_path):
img = cv2.imread(img_path)
# print("Shape of the captcha ",img.shape)
plt.imshow(img)
In [ ]:
fig=plt.figure(figsize=(20,20))
columns = 3
rows = 3
for i in range(1, columns*rows +1):
img = train_images_path + train_info['filename'][i]
fig.add_subplot(rows, columns, i)
plot_image(img)
plt.show()
Split Data into Train and Validation 🔪¶
- The next step is to think of a way to test how well our model is performing. we cannot use the test data given as it does not contain the data labels for us to verify.
- The workaround this is to split the given training data into training and validation. Typically validation sets give us an idea of how our model will perform on unforeseen data. it is like holding back a chunk of data while training our model and then using it to for the purpose of testing. it is a standard way to fine-tune hyperparameters in a model.
- There are multiple ways to split a dataset into validation and training sets. following are two popular ways to go about it, k-fold, leave one out. 🧐
- Validation sets are also used to avoid your model from overfitting on the train dataset.
In [ ]:
X_train, X_val= train_test_split(train_info, test_size=0.2, random_state=42)
- We have decided to split the data with 20 % as validation and 80 % as training.
- To learn more about the train_test_split function click here. 🧐
- This is of course the simplest way to validate your model by simply taking a random chunk of the train set and setting it aside solely for the purpose of testing our train model on unseen data. as mentioned in the previous block, you can experiment 🔬 with and choose more sophisticated techniques and make your model better.
- Now, since we have our data splitted into train and validation sets, we need to get the corresponding labels separated from the data.
- with this step we are all set move to the next step with a prepared dataset.
TRAINING PHASE 🏋️¶
We will use PyTesseract, an Optical Character Recognition library to recognize the characters in the test captcha directly and make a submission in this notebook.But lest see its performace on the train set.
Using PyTesseract on Training Set¶
In [ ]:
labels = []
all_filenames = []
for index,rows in train_info.iterrows():
i = rows['filename']
img_path = train_images_path + i
label = pytesseract.image_to_string(Image.open(img_path))
#Removing garbage characters
label = label.replace("\x0c","")
label = label.replace("\n","")
labels.append(label)
all_filenames.append(i)
print(f'{str(index+1)+"/" + str(train_info.shape[0])}\r',end="")
labels = np.asarray(labels)
all_filenames = np.asarray(all_filenames)
submission = pd.DataFrame()
submission['filename'] = all_filenames
submission['label'] = labels
Evaluate the Performance¶
Here for evaluation mean over normalised Levenshtein Similarity Score will be used to test the efficiency of the model.
In [ ]:
def cal_lshtein_score(s_true,s_pred):
if type(s_pred) == type(1.0):
return 0
score = textdistance.levenshtein.normalized_similarity(s_true,s_pred)
return score
In [ ]:
lst_scores = []
for idx in range(0,len(train_info)):
lst_scores.append(cal_lshtein_score(train_info['label'][idx],submission['label'][idx]))
mean_lst_score = np.mean(lst_scores)
print("The mean of normalised Levenshtein Similarity score is " ,mean_lst_score)
In [ ]:
labels = []
all_filenames = []
for index,rows in test_info.iterrows():
i = rows['filename']
img_path = test_images_path + i
label = pytesseract.image_to_string(Image.open(img_path))
#Removing garbage characters
label = label.replace("\x0c","")
label = label.replace("\n","")
labels.append(label)
all_filenames.append(i)
print(f'{str(index+1)+"/" + str(test_info.shape[0])}\r',end="")
labels = np.asarray(labels)
all_filenames = np.asarray(all_filenames)
submission_df = pd.DataFrame()
submission_df['filename'] = all_filenames
submission_df['label'] = labels
Save the prediction to csv¶
In [ ]:
submission_df.to_csv('submission.csv', index=False)
🚧 Note :¶
- Do take a look at the submission format.
- The submission file should contain a header.
- Follow all submission guidelines strictly to avoid inconvenience.
To download the generated csv in colab run the below command¶
In [ ]:
try:
from google.colab import files
files.download('submission.csv')
except:
print("Option Only avilable in Google Colab")