Image Classification Techniques
Techniques that could be used to improve image classifiction
- 1. k-fold crossvalidation
- 2. Oversampling and undersampling
- 3. Techniques from fastbook Chapter 7
- 4. Techniques I learned from Zach's walkwithfastai imagewoof lecture
- 5. Softlabeling and progressive label correction
This is going to be my first blog. I would like to start by stating the motivation for starting this. The main reason for starting is because people I respect in the deep learning (DL) community have all advocated for blogging as part of the learning process. Hence, I am hoping to articulate my learning and understanding through these blogs. It also means I am open to anyone correcting my understanding as well as to add to my current understanding. My blog is mostly going to be around DL and fastai. This year one of my goals is to be around the fastai community so I could learn from the amazing people and the conversation that takes place there.
In this blog, we will go through methods/techniques that help in image classification tasks. The following are the techniques I have been learning and as much as I can I would reference where I learnt the techniques from so anyone could learn from the source. The examples/codes will be using the fastai library.
Image classification is possibly the first task one would encounter when learning DL. Image classification is a computer vision task where a model classifies an image. For example, a cat or dog classifier classifies whether an image is a cat or a dog.
The types of image classifiction tasks
- binary image classification - a task in which the model has to predict between two classes (eg. cat or dog)
- multi-class image classification - a classification task in which the model has to predict between n-classes (eg. cat, dog, horse or bear)
- multi-label image classification - a classification task in which the model has to predict between n-classes and in each prediction there can be one or more than one predictions. (eg. cat and dog)
Throughout this blog we will make use of the Plant Pathology dataset from Kaggle to understand how the different techniques can be applied.
So first, lets understand our dataset.
from google.colab import drive
drive.mount('/content/drive')
!pip uninstall fastai -q -y
!pip install fastai --upgrade -q
from fastai.vision.all import *
from sklearn.model_selection import StratifiedKFold
SEED=101
set_seed(SEED)
path = Path('/content/drive/MyDrive/colab_notebooks/fastai/plant_pathology/data')
train = pd.read_csv(path/'train.csv')
train.head(3)
Let's look at the data. As can see from the above, our train.csv
contains the image_id and the labels. There are four classes - healthy, multiple_diseases, rust and scab
train['labels'] = train.iloc[:, 1:].idxmax(1)
train['labels'].value_counts(), len(train)
In total, there are 1,821 train images. Except for the multiple_diseases class, all other classes have similar number of training examples. One of the problems with this dataset is the relatively low number of multiple_diseases examples in the dataset. Later, we will see how we can use oversampling to help with this.
Now let's start with the first technique.
1. k-fold crossvalidation
Oftentimes, training data is scarce and you might want to use all the given data in training but because in crossvalidation (train-validation split) some percentage of data is kept for validation, and that data becomes unavailable for training our model. This is where k-fold crossvalidation could be useful. How does this work?
- Create
k
folds of validation data- Train
k
models using different validation set each time- During inference, make prediction on all
k
models and average the results
This way not only we will be ensembling k
models during inference, we would also use all the data in the training process.
Let's see how this is done.
N_FOLDS = 3
train['fold'] = -1
strat_kfold = StratifiedKFold(n_splits=N_FOLDS, shuffle=True)
for i, (_, test_index) in enumerate(strat_kfold.split(train.image_id.values, train['labels'].values)):
train.iloc[test_index, -1] = i
train['fold'] = train['fold'].astype('int')
train.head(5)
We have 3 folds (or 3 differenct validation sets)
train['fold'].value_counts()
train.groupby(['fold', 'labels']).size()
So we have created three validation sets with each sets having 607
samples. Also because we used stratified k-fold
, the different classes in each validation sets are about the same. Now we are ready to proceed with training our k
models.
2. Oversampling and undersampling
As we saw earlier, multiple_diseases
class has only 90 samples as compared to others averaging around 500+. This might disadvantage the multiple_diseases
class as the model might learn to predict multiple_diseases
less often to improve the metrics.
In such scenarios, oversampling can be used. Oversampling is nothing but copy-pasting the same training data of a certain class to increase its numbers.
Let's see how this is done.
def oversampling(df, fold, col2os='multiple_diseases', oversampling=3):
train_df_no_val = df[df['fold'] != {fold}] #training set
train_df_just_val = df[df['fold'] == {fold}] #validation set
#we only want oversample the multiple_disease class in the training set
train_df_bal = pd.concat(
[train_df_no_val[train_df_no_val['labels'] != col2os], train_df_just_val] +
[train_df_no_val[train_df_no_val['labels'] == col2os]] * oversampling
).sample(frac=1.0, random_state=SEED).reset_index(drop=True)
train_df_bal.reset_index(drop=True)
return train_df_bal
train_os = oversampling(train, 0)
We have more data in train_os where we have used oversampling of multiple_diseases class
len(train), len(train_os)
train_fold0 = train[train['fold'] != 0]
train_os_fold0 = train_os[train_os['fold'] != 0]
(print('train without oversamplig', '\n\n',
train_fold0['labels'].value_counts(),
'\n\n', 'train with oversampling', '\n\n',
train_os_fold0['labels'].value_counts(), sep=""))
As we can see we have 3x our multiple_diseases
class after using oversampling. Samples of other classes stay the same. Oversampling as well as its counterpart undersampling can be useful in balancing the sample size of different classes in the dataset. This allows the model to be trained with less bias towards any of the classes.
Chapter 7
3. Techniques from fastbookFastbook is an amazing resource to learn DL and it is my go to resource. In Chapter 7, advanced techniques for training an image classification model are introduced. Let's see what these techniques are.
- Normalization
- Data augmentation including MixUp (CutMix)
- Progressive resizing
- Test time augmentation
Normalization
We know that having the mean and std of our input data around 0 and 1 helps the model train more efficiently and helps in generalization. Hence, normalization is almost a default technique these days.
Generally, when we train image classification we start by transfer learning. These models would have been generally trained using the ImageNet dataset. Hence, when we normalize our data we use the mean and std of ImageNet dataset to normalize
our data.
If we are training from scratch, it's recommended to calculate the mean and std of the dataset for the 3-channels and use that to normalize the data. Also, during inference, the test data should be normalized using whatever stats that were used to normalize during the training.
Doing this in fastai is very easy. Let's take a look.
Let's write a function to make our dataloader
def get_dls(fold, df, img_sz=224):
datablock = DataBlock(
blocks=(ImageBlock, CategoryBlock()),
getters=[
ColReader('image_id', pref=path/'images', suff='.jpg'),
ColReader('labels')
],
splitter=IndexSplitter(df.loc[df.fold==fold].index),
item_tfms=Resize(img_sz),
)
return datablock.dataloaders(source=df, bs=32)
dls = get_dls(0, train_os)
x, y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])
Our mean and std are nowhere near 0 and 1.
Below is how we could calculate the mean and std of our dataset.
Note: This would only use the train dataset. Hence, a more stringent way would be to use all images and calculate the mean and std.
m,s = [0., 0., 0.], [0., 0., 0.]
count = 0
for x, y in next(iter(dls)):
m += np.array(x.mean(dim=[0,2,3]).cpu())
s += np.array(x.std(dim=[0,2,3]).cpu())
count += 1
m/count , s/count
Let's modify our get_dls
function slightly in the batch_tfms
argument. We have normalize and are using imagenet_stats
to normalize the data. Let's see what is imagenet_stats
first and see how this has changed our mean and std.
def get_dls(fold, df, img_sz):
datablock = DataBlock(
blocks=(ImageBlock, CategoryBlock()),
getters=[
ColReader('image_id', pref=path/'images', suff='.jpg'),
ColReader('labels')
],
splitter=IndexSplitter(df.loc[df.fold==fold].index),
item_tfms=Resize(img_sz),
batch_tfms=[Normalize.from_stats(*imagenet_stats)]
)
return datablock.dataloaders(source=df, bs=32)
As can be seen below, imagenet_stats
is a tuple that has the mean
and std
for the three channels. If you have the stats for your dataset, you could also pass it similarly to normalize the data.
imagenet_stats
dls = get_dls(0, train_os, 224)
x, y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])
The mean and std following normalization is much closer to 0 and 1.
Data augmentation is a well known technique to improve image classification. Fastai provides many of these data augmentation tools and they can be easily applied while creating the dataloaders like how we normalized earlier. Data augmentation can be passed as an argument in either item_tfms
or batch_tfms
while creating our datablock. The difference between the both is that the former make use of CPU while the latter make use of GPU. Hence, batch_tfms
is the preferred method to carry out most of the augmentation.
Data augmentation essentially allows us to enlarge our dataset size without getting new data. Data augmentation essentially uses synthetic data manipulation to create new images/training data.
Let's use this image to see some examples of the many data augmentation that comes with fastai.
img = PILImage(PILImage.create((path/'images').ls()[SEED]).resize((600,400)))
img
FlipItem flips the image at the given probability p
_,axs = subplots(2, 4)
for ax in axs.flatten():
show_image(FlipItem(p=0.5)(img, split_idx=0), ctx=ax)
Another technique is dihedral, let's see what it does
_,axs = subplots(2, 4)
for ax in axs.flatten():
show_image(DihedralItem(p=1.)(img, split_idx=0), ctx=ax)
RandomCropping the image at given size
_,axs = subplots(2, 4)
for ax in axs.flatten():
show_image(RandomCrop(224)(img, split_idx=0), ctx=ax)
And aug_transforms
which is an "Utility func to easily create a list of flip, rotate, zoom, warp, lighting transforms."
timg = TensorImage(array(img)).permute(2,0,1).float()/255.
def _batch_ex(bs): return TensorImage(timg[None].expand(bs, *timg.shape).clone())
Each image is different although the input image was the same
tfms = aug_transforms(pad_mode='zeros', mult=2, min_scale=0.5)
y = _batch_ex(9)
for t in tfms: y = t(y, split_idx=0)
_,axs = plt.subplots(2,3, figsize=(12,10))
for i,ax in enumerate(axs.flatten()): show_image(y[i], ctx=ax)
MixUp
MixUp is a data augmentation technique that was introduced in 2018 in this paper. So what happens during a MixUp?
- for a training image
image_1
, a second imageimage_2
is randomly selectednew_image
is created following this formula where alpha is a constant between 0. and 1.0 that is used to mix the two images
new_image = alpha * image_1 + (1-alpha) * image_2
- similarly, the targets of
image_1
andimage_2
are blended to createnew_target
For example, let's assume we are training a 4-class model and the >one-hot-encode for
image_1
is[0., 0., 1., 0.]
andimage_2
is[0., 0., >0., 1.]
. Also, let's assumealpha
is 0.3. The target for ournew_image
is >[0., 0., 0.3, 0.7]
.
new_target = 0.3 * [0., 0., 1., 0.] + (1-0.3) * [0., 0., 0., 1.]
- With this, now, we have completely new image for training.
Let's use the codes from fastai docs to see how our MixUp images look
mixup = MixUp(1.)
with Learner(dls, nn.Linear(3,4), loss_func=CrossEntropyLossFlat(), cbs=mixup) as learn:
learn.epoch,learn.training = 0,True
learn.dl = dls.train
b = dls.one_batch()
learn._split(b)
learn('before_batch')
_,axs = plt.subplots(3,3, figsize=(9,9))
dls.show_batch(b=(mixup.x,mixup.y), ctxs=axs.flatten())
As can be seen, some of our images above look a bit smeared/odd that is because of mixup. As can be seen, using MixUp with fastai is relatively easy. It is passed as a callback argument when we initiate a Learner. It can also be passed as a cbs
in fit_one_cycle
.
learn.fit_one_cycle(3, cbs=MixUp(1.0))
Note: the alpha we passed when we initiate the mixup will be used to generate a distribution the size of batch_size hence the alpha varies from one image to another. Below an example of generating the alpha distribution
torch.distributions.beta.Beta(tensor(1.), tensor(1.)).sample((10,))
CutMix
Although CutMix was not covered in the book, it has been added to the fastai library. CutMix is similar to MixUp but instead of blending images together, CutMix works by cropping a portion of image_2 and placing it in image_1. CutMix has been shown to work better than MixUp.
Source: CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Let's see some examples of CutMix in action
cutmix = CutMix(1.)
with Learner(dls, nn.Linear(3,4), loss_func=CrossEntropyLossFlat(), cbs=cutmix) as learn:
learn.epoch,learn.training = 0,True
learn.dl = dls.train
b = dls.one_batch()
learn._split(b)
learn('before_batch')
_,axs = plt.subplots(3,3, figsize=(9,9))
dls.show_batch(b=(cutmix.x,cutmix.y), ctxs=axs.flatten())
Progressive resizing
As stated in the book, progressive resizing gradually uses larger and larger images as we continue our training. This technique is akin to transfer learning. Our model learns on smaller images and as we increase the image size it carries forward what it had learnt in previous training as well as picks up something additional from the larger images.
Other things
Different architectures
Generally varying sizes of ResNet would be the first model to try and establish a baseline. After which one could explore other architecture such as efficientnet or the recently released Visual Transformer.
Transfer Learning
In most cases, transfer learning works really well hence it could be the first thing to try for most classification tasks.
imagewoof lecture
4. Techniques I learned from Zach's walkwithfastaiI highly recommend walkwithfastai course. It is also my go to resource for fastai. In this particular notebook, Zach introduces different techniques that seem to work really well for image classification tasks. Please check the notebook for references and details. Zach also has a lecture using this notebook here.
The notebook introduces the following techniques.
- xresnet which is an arch based on the "Bag of Tricks for ResNet" paper
- Mish - a new activation function
- ranger - a new optimizer that combines RAdam and Lookahead
- Self-attention
- MaxBlurPool
- a different LR scheduler - that uses flatten+anneal scheduling
- Label Smoothing Cross Entropy
5. Softlabeling and progressive label correction
Softlabeling
I came across softlabeling through Isaac Flath's amazing blog. I think the blog is the best place to get started on softlabeling.
In supervised learning, labels, which are created by humans, could be erroneous. This leads to the labels being 'noisy'. This was the case in the Plant Pathology competition and the winner used a similar method (softlabeling) in the winning solution.
How do we deal with such noisy labels? One way is to punish the model less for predicting incorrectly a noisy label. The steps are as follows
- Create a
k-fold crossvalidation
- Train
k
classifier using differentk-fold
for sufficient epochs using the noisy labels- Use the classifier to predict on the
k
th validation set and save the prediction- Upon completion of the above step, you will have two labels - one the noisy label that came with the data
label_ori
and another predicted by the above classifierslabel_pred
- Finally, train your actual classifier and this time when labels between
label_ori
andlabel_pred
differs, adjust the labels using a hyperparametera
For example, let's assume we are training a 4-class model and the one-hot-encode for label_ori
is [0., 0., 1., 0.]
and label_pred
is [0., 0., 0., 1.]
. Also, let's assume a
is 0.5. Our new_label
would be [0., 0., 0.5, 0.5]
. By doing this, the model would be punished less for predicting the wrong class as this could be due to noisy labels.
a = 0.5
label_ori = [0., 0., 1., 0.]
label_pred = [0., 0., 0., 1.]
new_label = a * label_ori + (1-a) * label_pred
new_label = [0., 0., 0.5, 0.5]
Progressive Label Correction
I came across this technique in thsi wonderful Kaggle Notebook by Kerem Turgutlu. It is a paper implementation of this paper.
Again, this method works in cases where there are noisy labels in the dataset. This is my understanding of how it is implemented.
- During model training, we let a model train normally for a
warm_up
period. In the above implementation, thewarm_up
period was 20% of the total iterations.- Once training goes over the
warm_up
iterations, Progressive Label Correction (PLC) kicks-in- In PLC, after an iteration, mislabeled indexes are identified
- Then, we calculate the probabilities of the max prediction class (
predicted_probas
- the class the model predicted) as well as the actual target class (actual_probas
- the class the model should have predicted) of the mislabeled indexes- we check if the absolute difference between
predicted_probas
and theactual_probas
is above atheta
value (theta
is a hyperparameter we set)- if the difference is higher, then for those mislabeled indexes, we change (‘correct’) the label y to be that predicted by the model
- we continue step 2 to step 6 while progressively lowering the theta using a scheduler function (linear scheduler was used in the above notebook).
Let's take a look what the ProgressiveLabelCorrection
callback in the notebook does.
dls = get_dls(0, train_os, 128)
learn = cnn_learner(dls, resnet18, pretrained=True)
Lets assume our learner has been trained for warm_up iterations and see how PLC is applied using this one_batch
learn.fit_one_cycle(1)
learn.one_batch(5, learn.dls.one_batch())
Here after an iteration, we check the predicted class and compare it to the target (y) to get the mislabelled indexes
preds_max = learn.pred.argmax(-1)
mislabeled_idxs = preds_max != learn.y
#so we have 32 samples in each iteration (which is the batch_size), of which 8 are mislabelled
mislabeled_idxs, len(mislabeled_idxs), mislabeled_idxs.float().sum()
Then we index into the mislabelled items and calclulate the probabilities.
We also index into the mislabelled targets.
mislabeled_probas = learn.pred[mislabeled_idxs].softmax(-1)
mislabeled_targs = learn.y[mislabeled_idxs]
Then we pick the probability of the predicted class
predicted_probas = mislabeled_probas.max(-1).values
predicted_probas
We also store the class of the mislabelled items
predicted_targs = mislabeled_probas.max(-1).indices
predicted_targs
Here we pick the predicted probability of the actual target class (probability for the target class that the model predicted)
eye = torch.eye(dls.c).to('cuda')
actual_probas = mislabeled_probas[eye[mislabeled_targs].bool()]
actual_probas
This is an important step we check if the abs difference between predicted_probas and actual_probas is above a hyperparameter theta
theta = 0.3
msk = torch.abs(predicted_probas - actual_probas) > theta
#there are 5 items that meets the condition
msk
We now gather the new targets that was predicted by the model for the 5 items that meets the condition
new_targs = learn.dls.tfms[1][1].vocab[predicted_targs[msk]]
new_targs
Now that we have the new_targs
we will update the labels for these indexes in the training set with new_targs
. The theta used is progressively lowered. Hence, as the training progresses we would progressively correct the targets even if the difference between probability of the predicted class and predicted probability of actual target class is small. This means as the training progresses we take the prediction by the model as the actual instead of the label that came with the data.
That's the end of the blog. Please feel free to contact me at @arshyma (Twitter) or marshath@gmail.com if there is anything. Thank you :)