Transfer learning and fine-tuning in Keras and Tensorflow to build an image recognition system and classify any object
Deep Learning
May 23, 2021

Transfer learning and fine-tuning in Keras and Tensorflow to build an image recognition system and classify any object

In another post, we covered how to use Keras to recognize any of the 1000 object categories in the ImageNet visual recognition challenge. More often than not, however, the categories we are interested in predicting are not on that list.

So what do we do if want to classify between different models of sunglasses? or shoes? or facial expressions? or different models of cars? or different types of lung disease in X-ray images?

This post will show you how to use transfer learning and fine-tuning to identify any customizable object categories! To recapitulate, here is the blog post series we’ll be following:

  1. Build an image recognition system for a 1000 everyday object categories (ImageNet ILSVRC) using Keras and TensorFlow
  2. Build an image recognition system for any customizable object categories using transfer learning and fine-tuning in Keras and TensorFlow
  3. Build a real-time bounding-box object detection system for hundreds of everyday object categories (PASCAL VOC, COCO)
  4. Build a web service for any image recognition or object detection system

Why use transfer learning/fine-tuning?

It’s well known that convolutional networks require significant amounts of data and resources to train. For example, the ImageNet ILSVRC model was trained on 1.2 million images over 2–3 weeks across multiple GPUs.

It has become the norm, not the exception, for researchers and practitioners alike to use transfer learning and fine-tuning, that is, transferring the network weights trained on a previous task like ImageNet to a new task.

And it does astoundingly well! Razavian et al (2014) showed that by simply using the features extracted using the weights from an ImageNet ILSVRC trained model, they achieved state-of-the-art or near state-of-the-art performance on a large variety of computer vision tasks.

There are two approaches we can take:

There are two approaches we can take:

  1. Transfer learning: take a ConvNet that has been pre-trained on ImageNet, remove the last fully-connected layer, then treat the rest of the ConvNet as a feature extractor for the new dataset. Once you extract the features for all images, train a classifier for the new dataset.
  2. Fine-tuning: replace and retrain the classifier on top of the ConvNet and fine-tune the pre-trained network's weights via backpropagation.

Which to use?

Two main factors will affect your choice of approach:

  1. Your dataset size
  2. The similarity of your dataset to the pre-trained dataset (typically ImageNet)

small-vs-large-dataset

*you should also experiment with training from scratch as well.

The figure above and the bullets below describe some general advice for when to choose which approach.

  • Similar & small dataset: avoid overfitting by not fine-tuning the weights on a small dataset, and use extracted features from the highest levels of the ConvNet to leverage dataset similarity.
  • Different & small dataset: avoid overfitting by not fine-tuning the weights on a small dataset, and use extracted features from lower levels of the ConvNet which are more generalizable.
  • Similar & large dataset: with a large dataset we can fine-tune the weights with less of a chance to overfit the training data.
  • Different & large dataset: with a large dataset we again can fine-tune the weights with less of a chance to overfit.

Data augmentation

A powerful and standard tool for increasing the dataset size and model generalizability is data augmentation. Every competition-winning ConvNet employs the use of data augmentation. Essentially, data augmentation is the process of artificially increasing the size of your dataset via transformations.

Most deep learning libraries have ready-made functions for typical transformations. For our image recognition system, the task you have is to decide which transformations make sense for your data (for example, X-ray images should probably not be rotated by more than 45 degrees because that would mean there was an error in the image acquisition step).

Image Recognition

Example transformations: Pixel color jitter, rotation, shearing, random cropping, horizontal flipping, stretching, lens correction.

Transfer learning and fine-tuning implementation

Data preparation

data-preparation.png

We’ll use Kaggle’s Dogs vs Cats dataset as our example, and setup our data with a training directory and a validation directory in this manner:

code-structure.png

Implementation

Let’s start by preparing our data generators:

train_datagen =  ImageDataGenerator(
    preprocessing_function=preprocess_input,
    rotation_range=30,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)
test_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    rotation_range=30,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)
train_generator = train_datagen.flow_from_directory(
  args.train_dir,
  target_size=(IM_WIDTH, IM_HEIGHT),
  batch_size=batch_size,
)
validation_generator = test_datagen.flow_from_directory(
  args.val_dir,
  target_size=(IM_WIDTH, IM_HEIGHT),
  batch_size=batch_size,
)

This is set by preprocessing_function = preprocess_input where preprocess_input is from the keras.applications.inception_v3 module.

The rotation, shifting, shearing, zooming, and flipping parameters signal ranges for their respective data augmentation transformations.

Next, we’ll instantiate the InceptionV3 network from the keras.applications module.

base_model = InceptionV3(weights='imagenet', include_top=False)

We use the flag include_top=False to leave out the weights of the last fully connected layer since that is specific to the ImageNet competition, from which the weights were previously trained. We’ll add and initialize a new last layer:

def add_new_last_layer(base_model, nb_classes):
  """Add last layer to the convnet
  Args:
    base_model: keras model excluding top
    nb_classes: # of classes
  Returns:
    new keras model with last layer
  """
  x = base_model.output
  x = GlobalAveragePooling2D()(x)
  x = Dense(FC_SIZE, activation='relu')(x) 
  predictions = Dense(nb_classes, activation='softmax')(x) 
  model = Model(input=base_model.input, output=predictions)
  return model

GlobalAveragePooling2D converts the MxNxC tensor output into a 1xC tensor where C is the # of channels.

Then we add on a fully-connected Dense layer of size 1024, and a softmax function on the output to squeeze the values between [0,1].

In this program, I’ll demonstrate both transfer learning and fine-tuning. You can use either or both if you like.

  1. Transfer learning: freeze all but the penultimate layer and re-train the last Dense layer
  2. Fine-tuning: un-freeze the lower convolutional layers and retrain more layers

Doing both, in that order, will ensure a more stable and consistent training. This is because the large gradient updates triggered by randomly initialized weights could wreck the learned weights in the convolutional base if not frozen. Once the last layer has stabilized (transfer learning), then we move onto retraining more layers (fine-tuning).

Transfer learning

def setup_to_transfer_learn(model, base_model):
  """Freeze all layers and compile the model"""
  for layer in base_model.layers:
    layer.trainable = False
    model.compile(optimizer='rmsprop',    
                loss='categorical_crossentropy', 
                metrics=['accuracy'])

Fine-tune

def setup_to_finetune(model):
   """Freeze the bottom NB_IV3_LAYERS and retrain the remaining top 
      layers.
   note: NB_IV3_LAYERS corresponds to the top 2 inception blocks in 
         the inceptionv3 architecture
   Args:
     model: keras model
   """
   for layer in model.layers[:NB_IV3_LAYERS_TO_FREEZE]:
      layer.trainable = False
   for layer in model.layers[NB_IV3_LAYERS_TO_FREEZE:]:
      layer.trainable = True
   model.compile(optimizer=SGD(lr=0.0001, momentum=0.9),   
                 loss='categorical_crossentropy')

When fine-tuning, it’s important to lower your learning rate relative to the rate that was used when training from scratch (lr=0.0001), otherwise, the optimization could destabilize and the loss diverge.

Training

Now we’re all set for training. Use fit_generator for both transfer learning and fine-tuning.

history = model.fit_generator(
  train_generator,
  samples_per_epoch=nb_train_samples,
  nb_epoch=nb_epoch,
  validation_data=validation_generator,
  nb_val_samples=nb_val_samples,
  class_weight='auto')
  model.save(args.output_model_file)

We’ll use an Amazon EC2 g2.2xlarge instance for training. If you’re unfamiliar with AWS, check out this tutorial (instead of using their prescribed AMI, search the community AMIs for “deep learning” — for this post, I used ami-638c1eo3in US-West-Oregon).

We can plot the training accuracies and loss using the history object

def plot_training(history):
  acc = history.history['acc']
  val_acc = history.history['val_acc']
  loss = history.history['loss']
  val_loss = history.history['val_loss']
  epochs = range(len(acc))
  
  plt.plot(epochs, acc, 'r.')
  plt.plot(epochs, val_acc, 'r')
  plt.title('Training and validation accuracy')
  
  plt.figure()
  plt.plot(epochs, loss, 'r.')
  plt.plot(epochs, val_loss, 'r-')
  plt.title('Training and validation loss')
  plt.show()

Prediction

Now that we have a saved keras.model we can modify the same predict() function we wrote in the last blog post to predict the class of a local image file or any file via a web URL.

python predict.py --image dog.001.jpg --model dc.model
python predict.py --image_url https://goo.gl/Xws7Tp --model dc.model

We’re done!

As an example, I trained a model on the dogs-vs-cats dataset using 24000 images for training and 1000 images for validation for 2 epochs. Even after only 2 epochs, the performance is pretty high:

Transfer learning

*model compatible with keras==1.2.2

Examples

python predict.py --image_url https://goo.gl/Xws7Tp --model dc.model

cat.png

python predict.py --image_url https://goo.gl/6TRUol --model dc.model

dog.png

Stay tuned for the next post in the series. (subscribe for our newsletter):

  1. Build an image recognition system for a 1000 everyday object categories (ImageNet ILSVRC) using Keras and TensorFlow
  2. Build an image recognition system for any customizable object categories using transfer learning and fine-tuning in Keras and TensorFlow
  3. Build a real-time bounding-box object detection system for hundreds of everyday object categories (PASCAL VOC, COCO)
  4. Build a web service for any image recognition or object detection system

Subscribe to our newsletter

* indicates required
Share this article:

More great articles

Voice Recognition for accessibility: making your website more inclusive

Voice recognition technology has the potential to make websites more accessible to individuals with disabilities by allowing them to interact with the website through voice commands.

Read Story
The Future of Websites: How Speech Recognition Will Change Everything

Stop Typing. Start Talking: How speech recognition will change the future of websites

We run in a world where everything should be fast, easy to find, and easy to use. Your customers don't have much time, and they are willing to receive your service now, without additional effort. But how can you help them?

Read Story

How to make a website use Voice Recognition

Voice recognition is a powerful technology, which the big tech players use. Those days, we can see enormous progress and significant improvement in the underlying algorithms, which make the technology more broadly available.

Read Story
Icon