This is going to be my first blog. I would like to start by stating the motivation for starting this. The main reason for starting is because people I respect in the deep learning (DL) community have all advocated for blogging as part of the learning process. Hence, I am hoping to articulate my learning and understanding through these blogs. It also means I am open to anyone correcting my understanding as well as to add to my current understanding. My blog is mostly going to be around DL and fastai. This year one of my goals is to be around the fastai community so I could learn from the amazing people and the conversation that takes place there.

In this blog, we will go through methods/techniques that help in image classification tasks. The following are the techniques I have been learning and as much as I can I would reference where I learnt the techniques from so anyone could learn from the source. The examples/codes will be using the fastai library.

Image classification is possibly the first task one would encounter when learning DL. Image classification is a computer vision task where a model classifies an image. For example, a cat or dog classifier classifies whether an image is a cat or a dog.

The types of image classifiction tasks

  1. binary image classification - a task in which the model has to predict between two classes (eg. cat or dog)
  2. multi-class image classification - a classification task in which the model has to predict between n-classes (eg. cat, dog, horse or bear)
  3. multi-label image classification - a classification task in which the model has to predict between n-classes and in each prediction there can be one or more than one predictions. (eg. cat and dog)

Throughout this blog we will make use of the Plant Pathology dataset from Kaggle to understand how the different techniques can be applied.

So first, lets understand our dataset.

from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
!pip uninstall fastai -q -y
!pip install fastai --upgrade -q
     |████████████████████████████████| 194kB 17.2MB/s 
     |████████████████████████████████| 61kB 9.7MB/s 
from fastai.vision.all import *
from sklearn.model_selection import StratifiedKFold
SEED=101
set_seed(SEED)
path = Path('/content/drive/MyDrive/colab_notebooks/fastai/plant_pathology/data')

train = pd.read_csv(path/'train.csv')
train.head(3)
image_id healthy multiple_diseases rust scab
0 Train_0 0 0 0 1
1 Train_1 0 1 0 0
2 Train_2 1 0 0 0

Let's look at the data. As can see from the above, our train.csv contains the image_id and the labels. There are four classes - healthy, multiple_diseases, rust and scab

train['labels'] = train.iloc[:, 1:].idxmax(1)
train['labels'].value_counts(), len(train)
(rust                 622
 scab                 592
 healthy              516
 multiple_diseases     91
 Name: labels, dtype: int64, 1821)

In total, there are 1,821 train images. Except for the multiple_diseases class, all other classes have similar number of training examples. One of the problems with this dataset is the relatively low number of multiple_diseases examples in the dataset. Later, we will see how we can use oversampling to help with this.

Now let's start with the first technique.

1. k-fold crossvalidation

Oftentimes, training data is scarce and you might want to use all the given data in training but because in crossvalidation (train-validation split) some percentage of data is kept for validation, and that data becomes unavailable for training our model. This is where k-fold crossvalidation could be useful. How does this work?

  1. Create k folds of validation data
  2. Train k models using different validation set each time
  3. During inference, make prediction on all k models and average the results

This way not only we will be ensembling k models during inference, we would also use all the data in the training process.

Let's see how this is done.

N_FOLDS = 3
train['fold'] = -1

strat_kfold = StratifiedKFold(n_splits=N_FOLDS, shuffle=True)
for i, (_, test_index) in enumerate(strat_kfold.split(train.image_id.values, train['labels'].values)):
    train.iloc[test_index, -1] = i
    
train['fold'] = train['fold'].astype('int')
train.head(5)
image_id healthy multiple_diseases rust scab labels fold
0 Train_0 0 0 0 1 scab 2
1 Train_1 0 1 0 0 multiple_diseases 2
2 Train_2 1 0 0 0 healthy 0
3 Train_3 0 0 1 0 rust 1
4 Train_4 1 0 0 0 healthy 2

We have 3 folds (or 3 differenct validation sets)

train['fold'].value_counts()
2    607
1    607
0    607
Name: fold, dtype: int64
train.groupby(['fold', 'labels']).size()
fold  labels           
0     healthy              172
      multiple_diseases     30
      rust                 207
      scab                 198
1     healthy              172
      multiple_diseases     31
      rust                 207
      scab                 197
2     healthy              172
      multiple_diseases     30
      rust                 208
      scab                 197
dtype: int64

So we have created three validation sets with each sets having 607 samples. Also because we used stratified k-fold, the different classes in each validation sets are about the same. Now we are ready to proceed with training our k models.

2. Oversampling and undersampling

As we saw earlier, multiple_diseases class has only 90 samples as compared to others averaging around 500+. This might disadvantage the multiple_diseases class as the model might learn to predict multiple_diseases less often to improve the metrics.

In such scenarios, oversampling can be used. Oversampling is nothing but copy-pasting the same training data of a certain class to increase its numbers.

Let's see how this is done.

def oversampling(df, fold, col2os='multiple_diseases', oversampling=3):

    train_df_no_val = df[df['fold'] != {fold}]     #training set
    train_df_just_val = df[df['fold'] == {fold}]   #validation set

    #we only want oversample the multiple_disease class in the training set
    train_df_bal = pd.concat(
                    [train_df_no_val[train_df_no_val['labels'] != col2os], train_df_just_val] +
                    [train_df_no_val[train_df_no_val['labels'] == col2os]] * oversampling
                    ).sample(frac=1.0, random_state=SEED).reset_index(drop=True)
    
    train_df_bal.reset_index(drop=True)
    
    return train_df_bal
train_os = oversampling(train, 0)

We have more data in train_os where we have used oversampling of multiple_diseases class

len(train), len(train_os)
(1821, 2003)
train_fold0 = train[train['fold'] != 0]
train_os_fold0 = train_os[train_os['fold'] != 0]
(print('train without oversamplig', '\n\n', 
       train_fold0['labels'].value_counts(), 
       '\n\n', 'train with oversampling', '\n\n', 
       train_os_fold0['labels'].value_counts(), sep=""))
train without oversamplig

rust                 415
scab                 394
healthy              344
multiple_diseases     61
Name: labels, dtype: int64

train with oversampling

rust                 415
scab                 394
healthy              344
multiple_diseases    183
Name: labels, dtype: int64

As we can see we have 3x our multiple_diseases class after using oversampling. Samples of other classes stay the same. Oversampling as well as its counterpart undersampling can be useful in balancing the sample size of different classes in the dataset. This allows the model to be trained with less bias towards any of the classes.

3. Techniques from fastbook Chapter 7

Fastbook is an amazing resource to learn DL and it is my go to resource. In Chapter 7, advanced techniques for training an image classification model are introduced. Let's see what these techniques are.

  1. Normalization
  2. Data augmentation including MixUp (CutMix)
  3. Progressive resizing
  4. Test time augmentation

Normalization

We know that having the mean and std of our input data around 0 and 1 helps the model train more efficiently and helps in generalization. Hence, normalization is almost a default technique these days.

Generally, when we train image classification we start by transfer learning. These models would have been generally trained using the ImageNet dataset. Hence, when we normalize our data we use the mean and std of ImageNet dataset to normalize our data.

If we are training from scratch, it's recommended to calculate the mean and std of the dataset for the 3-channels and use that to normalize the data. Also, during inference, the test data should be normalized using whatever stats that were used to normalize during the training.

Doing this in fastai is very easy. Let's take a look.

Let's write a function to make our dataloader

def get_dls(fold, df, img_sz=224):

  datablock = DataBlock(
              blocks=(ImageBlock, CategoryBlock()),
              getters=[
              ColReader('image_id', pref=path/'images', suff='.jpg'),
              ColReader('labels')
              ],
              splitter=IndexSplitter(df.loc[df.fold==fold].index),
              item_tfms=Resize(img_sz),
              )
  
  return datablock.dataloaders(source=df, bs=32)
dls = get_dls(0, train_os)
x, y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])
(TensorImage([0.3879, 0.5049, 0.2897], device='cuda:0'),
 TensorImage([0.1893, 0.1820, 0.1703], device='cuda:0'))

Our mean and std are nowhere near 0 and 1.

Below is how we could calculate the mean and std of our dataset.

Note: This would only use the train dataset. Hence, a more stringent way would be to use all images and calculate the mean and std.

m,s = [0., 0., 0.], [0., 0., 0.]
count = 0
for x, y in next(iter(dls)):
    
    m += np.array(x.mean(dim=[0,2,3]).cpu())
    s += np.array(x.std(dim=[0,2,3]).cpu())
    count += 1
m/count , s/count
(array([0.39644337, 0.51481164, 0.30797708]),
 array([0.19386025, 0.17980762, 0.17882748]))

Let's modify our get_dls function slightly in the batch_tfms argument. We have normalize and are using imagenet_stats to normalize the data. Let's see what is imagenet_stats first and see how this has changed our mean and std.

def get_dls(fold, df, img_sz):

  datablock = DataBlock(
              blocks=(ImageBlock, CategoryBlock()),
              getters=[
              ColReader('image_id', pref=path/'images', suff='.jpg'),
              ColReader('labels')
              ],
              splitter=IndexSplitter(df.loc[df.fold==fold].index),
              item_tfms=Resize(img_sz),
              batch_tfms=[Normalize.from_stats(*imagenet_stats)]
              )
  
  return datablock.dataloaders(source=df, bs=32)

As can be seen below, imagenet_stats is a tuple that has the mean and std for the three channels. If you have the stats for your dataset, you could also pass it similarly to normalize the data.

imagenet_stats
([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
dls = get_dls(0, train_os, 224)
x, y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])
(TensorImage([-0.3794,  0.2188, -0.4551], device='cuda:0'),
 TensorImage([0.9271, 0.8399, 0.8564], device='cuda:0'))

The mean and std following normalization is much closer to 0 and 1.

Data augmentation, MixUp and CutMix

Data augmentation is a well known technique to improve image classification. Fastai provides many of these data augmentation tools and they can be easily applied while creating the dataloaders like how we normalized earlier. Data augmentation can be passed as an argument in either item_tfms or batch_tfms while creating our datablock. The difference between the both is that the former make use of CPU while the latter make use of GPU. Hence, batch_tfms is the preferred method to carry out most of the augmentation.

Data augmentation essentially allows us to enlarge our dataset size without getting new data. Data augmentation essentially uses synthetic data manipulation to create new images/training data.

Let's use this image to see some examples of the many data augmentation that comes with fastai.

img = PILImage(PILImage.create((path/'images').ls()[SEED]).resize((600,400)))
img

FlipItem flips the image at the given probability p

_,axs = subplots(2, 4)
for ax in axs.flatten():
    show_image(FlipItem(p=0.5)(img, split_idx=0), ctx=ax)

Another technique is dihedral, let's see what it does

_,axs = subplots(2, 4)
for ax in axs.flatten():
    show_image(DihedralItem(p=1.)(img, split_idx=0), ctx=ax)

RandomCropping the image at given size

_,axs = subplots(2, 4)
for ax in axs.flatten():
    show_image(RandomCrop(224)(img, split_idx=0), ctx=ax)

And aug_transforms which is an "Utility func to easily create a list of flip, rotate, zoom, warp, lighting transforms."

timg = TensorImage(array(img)).permute(2,0,1).float()/255.
def _batch_ex(bs): return TensorImage(timg[None].expand(bs, *timg.shape).clone())

Each image is different although the input image was the same

tfms = aug_transforms(pad_mode='zeros', mult=2, min_scale=0.5)
y = _batch_ex(9)
for t in tfms: y = t(y, split_idx=0)
_,axs = plt.subplots(2,3, figsize=(12,10))
for i,ax in enumerate(axs.flatten()): show_image(y[i], ctx=ax)

MixUp

MixUp is a data augmentation technique that was introduced in 2018 in this paper. So what happens during a MixUp?

  1. for a training image image_1, a second image image_2 is randomly selected
  2. new_image is created following this formula where alpha is a constant between 0. and 1.0 that is used to mix the two images
new_image = alpha * image_1 + (1-alpha) * image_2
  1. similarly, the targets of image_1 and image_2 are blended to create new_target

For example, let's assume we are training a 4-class model and the >one-hot-encode for image_1 is [0., 0., 1., 0.] and image_2 is [0., 0., >0., 1.]. Also, let's assume alpha is 0.3. The target for our new_image is >[0., 0., 0.3, 0.7].

new_target = 0.3 * [0., 0., 1., 0.]  + (1-0.3) * [0., 0., 0., 1.]
  1. With this, now, we have completely new image for training.

Let's use the codes from fastai docs to see how our MixUp images look

mixup = MixUp(1.)

with Learner(dls, nn.Linear(3,4), loss_func=CrossEntropyLossFlat(), cbs=mixup) as learn:
    learn.epoch,learn.training = 0,True
    learn.dl = dls.train
    b = dls.one_batch()
    learn._split(b)
    learn('before_batch')

_,axs = plt.subplots(3,3, figsize=(9,9))
dls.show_batch(b=(mixup.x,mixup.y), ctxs=axs.flatten())
epoch train_loss valid_loss time
0 00:01

As can be seen, some of our images above look a bit smeared/odd that is because of mixup. As can be seen, using MixUp with fastai is relatively easy. It is passed as a callback argument when we initiate a Learner. It can also be passed as a cbs in fit_one_cycle.

learn.fit_one_cycle(3, cbs=MixUp(1.0))

Note: the alpha we passed when we initiate the mixup will be used to generate a distribution the size of batch_size hence the alpha varies from one image to another. Below an example of generating the alpha distribution

torch.distributions.beta.Beta(tensor(1.), tensor(1.)).sample((10,))
tensor([0.4794, 0.3758, 0.1914, 0.6586, 0.6198, 0.5889, 0.1123, 0.9081, 0.2395,
        0.4103])

CutMix

Although CutMix was not covered in the book, it has been added to the fastai library. CutMix is similar to MixUp but instead of blending images together, CutMix works by cropping a portion of image_2 and placing it in image_1. CutMix has been shown to work better than MixUp.

image.png

Source: CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Let's see some examples of CutMix in action

cutmix = CutMix(1.)
with Learner(dls, nn.Linear(3,4), loss_func=CrossEntropyLossFlat(), cbs=cutmix) as learn:
    learn.epoch,learn.training = 0,True
    learn.dl = dls.train
    b = dls.one_batch()
    learn._split(b)
    learn('before_batch')

_,axs = plt.subplots(3,3, figsize=(9,9))
dls.show_batch(b=(cutmix.x,cutmix.y), ctxs=axs.flatten())
epoch train_loss valid_loss time
0 00:01

Progressive resizing

As stated in the book, progressive resizing gradually uses larger and larger images as we continue our training. This technique is akin to transfer learning. Our model learns on smaller images and as we increase the image size it carries forward what it had learnt in previous training as well as picks up something additional from the larger images.

Test time augmentation

As taken from the fastbook, "test time augmentation (TTA): During inference or validation, creating multiple versions of each image, using data augmentation, and then taking the average or maximum of the predictions for each augmented version of the image."

Other things

Different architectures

Generally varying sizes of ResNet would be the first model to try and establish a baseline. After which one could explore other architecture such as efficientnet or the recently released Visual Transformer.

Transfer Learning

In most cases, transfer learning works really well hence it could be the first thing to try for most classification tasks.

4. Techniques I learned from Zach's walkwithfastai imagewoof lecture

I highly recommend walkwithfastai course. It is also my go to resource for fastai. In this particular notebook, Zach introduces different techniques that seem to work really well for image classification tasks. Please check the notebook for references and details. Zach also has a lecture using this notebook here.

The notebook introduces the following techniques.

  1. xresnet which is an arch based on the "Bag of Tricks for ResNet" paper
  2. Mish - a new activation function
  3. ranger - a new optimizer that combines RAdam and Lookahead
  4. Self-attention
  5. MaxBlurPool
  6. a different LR scheduler - that uses flatten+anneal scheduling
  7. Label Smoothing Cross Entropy

5. Softlabeling and progressive label correction

Softlabeling

I came across softlabeling through Isaac Flath's amazing blog. I think the blog is the best place to get started on softlabeling.

In supervised learning, labels, which are created by humans, could be erroneous. This leads to the labels being 'noisy'. This was the case in the Plant Pathology competition and the winner used a similar method (softlabeling) in the winning solution.

How do we deal with such noisy labels? One way is to punish the model less for predicting incorrectly a noisy label. The steps are as follows

  1. Create a k-fold crossvalidation
  2. Train k classifier using different k-fold for sufficient epochs using the noisy labels
  3. Use the classifier to predict on the kth validation set and save the prediction
  4. Upon completion of the above step, you will have two labels - one the noisy label that came with the data label_ori and another predicted by the above classifiers label_pred
  5. Finally, train your actual classifier and this time when labels between label_ori and label_pred differs, adjust the labels using a hyperparameter a

For example, let's assume we are training a 4-class model and the one-hot-encode for label_ori is [0., 0., 1., 0.] and label_pred is [0., 0., 0., 1.]. Also, let's assume a is 0.5. Our new_label would be [0., 0., 0.5, 0.5]. By doing this, the model would be punished less for predicting the wrong class as this could be due to noisy labels.

a = 0.5
label_ori = [0., 0., 1., 0.]
label_pred = [0., 0., 0., 1.]
new_label = a * label_ori  + (1-a) * label_pred
new_label = [0., 0., 0.5, 0.5]

Progressive Label Correction

I came across this technique in thsi wonderful Kaggle Notebook by Kerem Turgutlu. It is a paper implementation of this paper.

Again, this method works in cases where there are noisy labels in the dataset. This is my understanding of how it is implemented.

  1. During model training, we let a model train normally for a warm_up period. In the above implementation, the warm_up period was 20% of the total iterations.
  2. Once training goes over the warm_up iterations, Progressive Label Correction (PLC) kicks-in
  3. In PLC, after an iteration, mislabeled indexes are identified
  4. Then, we calculate the probabilities of the max prediction class (predicted_probas - the class the model predicted) as well as the actual target class (actual_probas - the class the model should have predicted) of the mislabeled indexes
  5. we check if the absolute difference between predicted_probas and the actual_probas is above a theta value (theta is a hyperparameter we set)
  6. if the difference is higher, then for those mislabeled indexes, we change (‘correct’) the label y to be that predicted by the model
  7. we continue step 2 to step 6 while progressively lowering the theta using a scheduler function (linear scheduler was used in the above notebook).

Let's take a look what the ProgressiveLabelCorrection callback in the notebook does.

dls = get_dls(0, train_os, 128)
learn = cnn_learner(dls, resnet18, pretrained=True)

Lets assume our learner has been trained for warm_up iterations and see how PLC is applied using this one_batch

learn.fit_one_cycle(1)
learn.one_batch(5, learn.dls.one_batch())
0.00% [0/1 00:00<00:00]
epoch train_loss valid_loss time
0 1.573150 1.102266 02:34
28.57% [6/21 01:46<04:27 1.5732]

Here after an iteration, we check the predicted class and compare it to the target (y) to get the mislabelled indexes

preds_max = learn.pred.argmax(-1)
mislabeled_idxs = preds_max != learn.y

#so we have 32 samples in each iteration (which is the batch_size), of which 8 are mislabelled
mislabeled_idxs, len(mislabeled_idxs), mislabeled_idxs.float().sum()
(TensorCategory([False,  True, False,  True, False,  True, False, False, False, False,
          True, False, False, False, False, False, False, False,  True, False,
         False, False, False, False, False,  True, False,  True,  True, False,
         False, False], device='cuda:0'),
 32,
 TensorCategory(8., device='cuda:0'))

Then we index into the mislabelled items and calclulate the probabilities.

We also index into the mislabelled targets.

mislabeled_probas = learn.pred[mislabeled_idxs].softmax(-1)
mislabeled_targs = learn.y[mislabeled_idxs]

Then we pick the probability of the predicted class

predicted_probas = mislabeled_probas.max(-1).values
predicted_probas
tensor([0.7182, 0.6797, 0.6050, 0.6181, 0.4840, 0.4588, 0.5954, 0.7079],
       device='cuda:0', grad_fn=<MaxBackward0>)

We also store the class of the mislabelled items

predicted_targs = mislabeled_probas.max(-1).indices
predicted_targs
tensor([2, 1, 0, 3, 0, 0, 1, 2], device='cuda:0')

Here we pick the predicted probability of the actual target class (probability for the target class that the model predicted)

eye = torch.eye(dls.c).to('cuda')
actual_probas = mislabeled_probas[eye[mislabeled_targs].bool()]
actual_probas
tensor([0.1144, 0.2731, 0.2278, 0.2835, 0.3380, 0.2353, 0.3692, 0.1910],
       device='cuda:0', grad_fn=<IndexBackward>)

This is an important step we check if the abs difference between predicted_probas and actual_probas is above a hyperparameter theta

theta = 0.3 
msk = torch.abs(predicted_probas - actual_probas) > theta
#there are 5 items that meets the condition
msk
tensor([ True,  True,  True,  True, False, False, False,  True],
       device='cuda:0')

We now gather the new targets that was predicted by the model for the 5 items that meets the condition

new_targs = learn.dls.tfms[1][1].vocab[predicted_targs[msk]]
new_targs
(#5) ['rust','multiple_diseases','healthy','scab','rust']

Now that we have the new_targs we will update the labels for these indexes in the training set with new_targs. The theta used is progressively lowered. Hence, as the training progresses we would progressively correct the targets even if the difference between probability of the predicted class and predicted probability of actual target class is small. This means as the training progresses we take the prediction by the model as the actual instead of the label that came with the data.

That's the end of the blog. Please feel free to contact me at @arshyma (Twitter) or marshath@gmail.com if there is anything. Thank you :)