Deep learning networks benefit greatly from large data samples. State-of-the-art image classifiers and object detectors are all trained on large databases of labelled images, such as ImageNet, coco or Open Images Dataset. In order to further enlarge the amount of data, images are augmented during training. This can be done by, for instance, shifting the image or adjusting the brightness. Other use cases where labelled data is limited use these large datasets through transfer learning: taking a pre-trained model and continue training for a specific domain from there. However, what happens if you want to improve your model further but run into the situation where the amount of labelled data is insufficient?
In this post we will discuss a method —presented in this paper — by which a model can learn representations of the data by predicting image rotations, rather than the actual label of the image. In this way, the model can first be trained in an unsupervised fashion on unlabelled data, and then fine-tuned on the remaining set of labelled data.
Unsupervised learning on MNIST
For this tutorial we will look at the Fashion MNIST dataset. It is similar to the MNIST dataset, but rather than handwritten digits it consists out of simple fashion items. Since we will be learning representations through rotations, Fashion MNIST is better suited than MNIST, since learning representations through rotations works better for objects that have no rotation symmetry. In other words, it is difficult to classify 0 and 180 degree rotations for numbers such as 0, 6, 8 and 9. For clothes there is no ambiguity.
Now imagine the situation in which we have a limited set of 1,000 labelled images of clothing. In addition we have nearly 50,000 images without labels. We could simply train a classifier on the 1000 images, but chances are it will not perform well because of the small data sample. Instead, we can use the large set of unlabelled images to create a neural network which serves as a feature extractor, i.e. it learns representations of clothing items. By changing the output layer, but keeping the weights of the base layers (feature extractor), we can adapt this feature extractor for the task of classifying clothing items. What does this look like in practice?
We will use a simple neural network with two convolutional layers as a feature extractor.
Fig 1: Feature extractor
This model will be our feature extractor.
Training on rotations
Next, we take the feature extractor and add to it an output layer with 4 nodes to it: for 0, 90, 180 and 270 degrees rotations.
We will train model to recognize rotations on the dataset with about 50k images (for details see Figure 3 or the implementation in the example notebook).
The model learns to classify rotations with about 97% accuracy, but how well can this feature extractor distinguish clothing items? To gain an intuition we used UMAP to map reduce the 16 dimensional feature vectors of the test set (10k images) to two dimensions and colorcoded them. Clearly there is quite some overlap, however, we can also see that specific types of items are grouped (e.g. trousers, ankle boots), meaning the model has learned what these items are to some extent. However, the fact that many items overlap will also mean that the representations learned are not mutually exclusive. As we will see below, if we do not fine tune the feature extractor for the item task classification task, it will perform relatively poorly.
Finally, we perform supervised learning on the 1000 labelled images. For reference, we first train a model from scratch to see how it performs. Of course it will have the same architecture as the feature extractor (except for the output layer) for an apples to apples comparison. This model reaches an accuracy of about 60% on the test set, mainly due to the limited training data.
In addition, we train a second model that will make use of the feature extractor which has been pretrained on the rotation task. In other words, this model already has some idea of what clothing items look like.
First, we only train the output layer on the 1000 images. In this case we reach an accuracy of about 40%, worse than the model that we trained from scratch, but remember that we only trained a single layer of this network for this task. As soon as we also open up the layers of the feature extractor for training, i.e. finetuning them for the specific task at hand — classifying clothing items — we reach an accuracy of about 75%. Quite a bit better than the model that was not pretrained on the rotations!
We have shown that by pretraining a feature extractor on a different task, in this case learning rotations of images, we can learn some representation of the data at hand. The feature extractor can then be leveraged to improve performance for the task at hand, for which we only have limited labelled data, in our case learning to classify clothing objects with only 1000 images total.