Chapter 7: State-of-the-Art Training for CV

This chapter introduces more advanced techniques for training an image classifica‐ tion model and getting state-of-the-art results.
chapter-notes
code
Author

Uzair Tahamid Siam

Published

June 29, 2023

We will look at what normalization is, a powerful data augmentation technique called Mixup, the progressive resizing approach, and test time augmentation. In order to properly understand how good these techniques are, we will also train a model from scratch on a subset of the Imagenette (of 10 very different categories) dataset. We’re also using full-size, full-color images, which are photos of objects of different sizes, in different orientations, in different lighting, and so forth.

These techniques can be crucial for the performance of from-scratch models as well as pretrained models trained on very different dataset.

Your Data

Your data plays a large role in the models you build - the quality of it, the size, the format and more.

An important message here is: the dataset you are given is not necessarily the dataset you want. It’s particularly unlikely to be the dataset that you want to do your development and prototyping in. You should aim to have an iteration speed of no more than a couple of minutes—that is, when you come up with a new idea you want to try out, you should be able to train a model and see how it goes within a couple of minutes.

Let’s get the data now.

from fastai.vision.all import *
path = untar_data(URLs.IMAGENETTE)

dblock = DataBlock(
    blocks = (ImageBlock, CategoryBlock),
    get_items = get_image_files,
    get_y = parent_label,
    splitter = RandomSplitter(0.2, seed=42)
    item_tfms=Resize(460),
    batch_tfms=aug_transforms(size=224, min_scale=0.75)
)

dls = dblock.dataloaders(path, bs=64)
dls.show_batch()

Let’s train a quick model to be our baseline.

model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

This should still perform decent BUT when working with models that are being trained from scratch, or fine-tuned to a very different dataset from the one used for the pretraining, some additional techni‐ ques are really important.

Let’s explore some of these techniques.

Normalization

When training a model, it helps if your input data is normalized — has a mean of 0 and a standard deviation of 1. But most images and computer vision libraries use values between 0 and 255 for pixels, or between 0 and 1; in either case, your data is not going to have a mean of 0 and a standard deviation of 1.

We can look at this by grabbing a batch from our DataLoaders.

x,y = dls.one_batch()

# take the mean and stdev along the images, width and height
# this should result in a mean and stdev for 3 channels
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])

Normalizing your data is very easy in fastai. Just use the Normalize transform. This acts on a whole mini-batch at once, so you can add it to the batch_tfms section of your data block. You just need to pass in the mean and standard deviation that you want to use. fastai comes with the standard ImageNet mean and standard deviation already defined. In the case that nothing is passed in, fastai will automatically calculate the stats from a single batch of your data.

Let’s add this transform (using imagenet_stats, as Imagenette is a subset of ImageNet) and take a look at one batch now:

def get_dls(bs, size):
    dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                       get_items=get_image_files,
                       get_y=parent_label,
                       item_tfms=Resize(460),
                       batch_tfms=[*aug_transforms(size=size, min_scale=0.75),Normalize.from_stats(*imagenet_stats)]
    ) 
    return dblock.dataloaders(path, bs=bs)

dls = get_dls(64, 224)
x,y = dls.one_batch()
x.mean(dim=[0,2,3]),x.std(dim=[0,2,3])


model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

Although it helped only a little here, normalization becomes especially important when using pretrained models. The pretrained model knows how to work with only data of the type that it has seen before. If the average pixel value was 0 in the data it was trained with, but your data has 0 as the minimum possible value of a pixel, then the model is going to be seeing something very different from what is intended!

This means when you distribute a pretrained model you need to distribute the stats too and if you are using model that some‐ one else has trained, make sure you find out what normalization statistics they used, and match them.

We didn’t have to handle normalization in previous chapters because when using a pretrained model through cnn_learner, the fastai library automatically adds the proper Normalize transform. Here we don’t have a pretrained model, so we need to do it manually.

Let’s look at the next step of improving our model!

Progressive Resizing

All our training up until now has been done at size 224. We could have begun training at a smaller size before going to that. This is called progressive resizing.

Progressive Resizing is the practice of starting training using small images, and ending training using large images. Spending most of the epochs training with small images helps training complete much faster. Completing training using large images makes the final accuracy much higher.

As we have seen, the kinds of features that are learned by convolutional neural networks are not in any way specific to the size of the image—early layers find things like edges and gradients, and later layers may find things like noses and sunsets. So, when we change image size in the middle of training, it doesn’t mean that we have to find totally different parameters for our model. However, there has to be some difference between smaller and larger images.

This is somewhat akin to the idea of transfer learning because we are trying to get our model to learn something a bit different from last time. As a result, fine_tune should be the method we can use to train after we resize our images.

Additionally, it is a form of data augmentation that should result in better generalization of your model.

First, we train a model then we resize and then finally we fine-tune.

dls = get_dls(128, 128)
learn = Learner(dls, xresnet50(), loss_func=CrossEntropyLossFlat(),
                    metrics=accuracy)
learn.fit_one_cycle(4, 3e-3)

# change the dataloaders of the learner to a new dls
learn.dls = get_dls(64, 224)
learn.fine_tune(5, 1e-3)

We should be getting much better performance now, and the initial training on small images was much faster on each epoch.

Some important things to remember:

  • For transfer learning, progressive resizing may actually hurt performance. This is most likely to happen if your pretrained model was quite similar to your transfer learning task and the dataset and was trained on similar-sized images, so the weights don’t need to be changed much. In that case, training on smaller images may damage the pretrained weights.

  • If the images for the transfer learning task are of different shapes and sizes than the images that model was pretrained on, it will most likely improve performance.

We have talked about how augmentations are helpful and can help with generalization. But so far we have only talked about them for training. What if we tried augmentations on validation sets? Let’s look at that next.

Test Time Augmentation (TTA)

When we use random cropping, fastai will automatically use center-cropping for the validation set i.e. it will select the largest square in the center of the image.

This can be very bad at times. Think about a multilabel problem where one of the targets is in a corner or a classification problem where a crucial part of the target is not in the center. So, how do we avoid this problem?

Well one simple thing to do would be to not randomly crop and instead use squishing or stretching. But then the model has to learn how to recognize distorted images which might be harder. The better solutions is to actually crop the original image multiple times, pass all of them to the model, and then take the maximum or average of the predictions. In fact, why only do this for cropping? We can do this for all the augmentations! That is called test time augmentation (TTA).

Depending on the dataset, test time augmentation can result in dramatic improvements in accuracy. It does not change the time required to train at all, but will increase the amount of time required for validation or inference by the number of test-time-augmented images requested. By default fastai will use the unaugmented center-crop image and four randomly augmented images.

preds,targs = learn.tta()
accuracy(preds, targs).item()

Now that we can appreciate how much augmentations can improve performance, let’s look at a new type of augmentation: Mixup.

Mixup

Mixup is a data augmentation technique that can provide dramatically higher accuracy, especially if you don’t have much data or don’t have a pretrained model on a similar dataset.

Most augmentations are dataset dependent and also are limited. E.g. if we want to flip and image, should we flip horizontally or vertically or both? Depends on the dataset. What if we flip once doesn’t work? Can we flip multiple times? Well, not really.

Mixup has a more of a sliding scale approach unlike most augmentations - you can dial up it up or down.

Mixup works as follows, for each image: 1. Select another image from your dataset at random. 2. Pick a weight at random. 3. Take a weighted average (using the weight from step 2) of the selected image with your image; this will be your independent variable. 4. Take a weighted average (with the same weight) of this image’s labels with your image’s labels; this will be your dependent variable.

For this to work, our targets need to be one-hot encoded.

The equation for mixup is simple: \[ \begin{aligned} \tilde{x} = \lambda x_i + (1 - \lambda)x_j \\ \tilde{y} = \lambda y_i + (1 - \lambda)y_j \end{aligned} \] where \(y_i\), \(y_j\) are one-hot label encodings.

Let’s say you have two images - a church and a gas station. We can augment the two by adding 0.3 times the church and 0.7 times the gas station. Now what should our model predict? Well, it should predict 30% church and 70% gas station. Why?

Suppose we have 10 classes, and “church” is represented by the index 2 and “gas station” by the index 7. Then the one-hot encodings are:

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

So, the linear combination would be:

\[ 0.3[0, 0, 1, 0, 0, 0, 0, 0, 0, 0] + 0.7[0, 0, 0, 0, 0, 0, 0, 1, 0, 0] = [0, 0, 0.3, 0, 0, 0, 0, 0.7, 0, 0] \]

Mixup is very easy to use in fastai using a Learner callback. Callbacks are what is used inside fastai to inject custom behavior in the training loop (like a learning rate schedule, or training in mixed precision). We will learn more about them later. For now, all you need to know is that you use the cbs parameter to Learner to pass callbacks.

model = xresnet50()
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(),
            metrics=accuracy, cbs=Mixup)
learn.fit_one_cycle(5, 3e-3)

Mixup data is more difficult to train. It’s harder to see what’s in the each image and also the model has to predict two labels AND figure out how they are weighted. It’s also very unlikely for the model to overfit on any data because we are not showing the same images in each epoch but are instead showing random combinations of two images.

One important caveat of using Mixup is that it is useful if you train for larger number of epochs (>80).

Mixup can be applied to other types of data as well. In fact, some people have even shown good results by using Mixup on activations inside their models, not just on inputs—this allows Mixup to be used for NLP and other data types too.

There’s one more subtle thing that mixup deals with for us. If you think about our activations we have seen so far - sigmoid and softmax - they can never produce outputs that are 0 and 1. This means our loss can never be perfect. With mixup that is no longer a problem because our labels will only ever be 1 or 0 if we mixup with the same image. But there’s an issue with using mixup only for this problem. Mixup is “accidentally” making the labels bigger than 0 or smaller than 1. That is to say, we’re not explicitly telling our model that we want to change the labels in this way. If we want to achieve activations closer or further away from 0 or 1 we then have to change the amount of mixup which might not be what we want.

Our goal then is to find a method that does exactly this but does not require you to apply more or less mixup augmentations as well. That is where label smoothing comes in.

Label Smoothing

Theoretically, the loss function for classification has one-hot encoded targets (practically that takes too much memory so we do something more efficient that essentially works like a one-hot encoded vector). This means that the model is trained to return 0 for all categories but one, for which it is trained to return 1. Even if the prediction is 0.999 it is still not good enough and the model will keep calculating gradients and updating the weights leading to overfitting.

This is bad because at inference time it will always say 1 for the predicted category even if it’s not too sure (it will be overconfident), just because it was trained this way.

Now imagine the situation where you have a dataset with mislabeled images or images that contain two different types of targets. Your model will train on that mislabeled data, be overconfident, and eventually give you wrong results at inference time.

As a rule of thumb, your data will never be perfect. Therefore, we need to solve this problem of trying to guess exactly 1 or 0. We do this by replacing our 1s with a number bit less than 1 and our 0s with a number bit more than 0. The process of doing this is called label smoothing.

Label Smoothing helps us encourage our model to be less confident and as a result makes the model more robust to mislabeled data ensuring better generalization during inference.

So, how does it work?

In practice, we start with one-hot encoded labels then replace all the 0s with \(\frac{\epsilon}{N}\) where \(\epsilon\) is a parameter (usually 0.1 i.e. 10% unsure about our labels) and \(N\) is the number of classes. Since we want the labels to add up to 1, we also replace the 1s with \(1 - \epsilon + \frac{\epsilon}{N}\).

\[ \begin{aligned} \sum_{i=1}^N p =& 1 \\ p_0(N-1) + p_1 =& 1 \\ p_1 =& 1 - p_0N + p_0 \\ p_1 =& 1 - \frac{\epsilon}{N}N + \frac{\epsilon}{N} \\ p_1 =& 1 - \epsilon + \frac{\epsilon}{N} \end{aligned} \]

Let, say if \(N=10\) and we use a default \(\epsilon=0.1\) then our labels for a target corresponding to index 3 will be:

\[[0.01, 0.01, 0.01, 0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]\]

In practice, we don’t want to one-hot encode the labels, and fortunately we won’t need to (the one-hot encoding is just good to explain label smoothing and visualize it).

To use this in practice, we just have to change the loss function in our call to Learner:

model = xresnet50()
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(),
metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

As with Mixup, to get benefits from label smoothing we need to train many epochs.

Conclusion

In CH-5 and CH-6 we learned a lot about transfer learning for CV. Now, we have learned techniques to help us with training models from scratch for CV. All that is left to do now is try these methods out on our own problems.