I first build a small neural network to detect nose tip location. I used the IMM Face Database, which is annotated with 58 facial keypoints, and focused on just predicting the nose tips.
Before training, I first had to preprocess the image data. In particular, I wrote a custom Pytorch dataloader and loaded in the images after downscaling to 60x80 pixels, converting to grayscale, and normalizing the pixel values to between -0.5 and 0.5, which allows for faster and more stable training. I also split my set into a training set and a validation set such that I can accurately validate the error during training.
Sampled images from my dataloader visualized with ground-truth keypoints.
I then had to build the actual model. Here I used a simple CNN, with the architecture in the Appendix.
I trained with Adam optimizer (lr=1e-5) and MSE loss between actual and predicted landmark location for 20 epochs (cycles through entire training dataset), testing on the validation set at the end of every epoch.
2 facial images which the network detects the nose correctly. Green=Ground Truth, Red=Predictions
2 facial images which the network detects the nose incorrectly. I think it fails in these 2 cases because the faces are not close to the ordinary position. One is shifted far to the left, the other is angled heavily to one side. This is in contrast to the cases where it detects correctly, where the face is facing fowards in the center of the image.
After experiementing with detecting a single nose tip, I transition to detecting points on the whole face. Additionally, since we actually don't have too many training images, one common technique I employ is data augmentation. This involves applying transformations on the original data to effectively generate more data, allowing the model to effecting train more epochs without overfitting as easily. Here, I implemented rotations (by using a rotation matrix to rotate the landmarks as well) and randomly jittered the image brightness.
Sampled image from my dataloader visualized with ground-truth keypoints.
Since we're predicting all 58 keypoints now, I used a larger model for this part to capture the increased complexity of the problem. A critical part was making sure the outputs were 116 dimensional this time (58x2 datapoints for 58 keypoints) instead of just 2 dimensional as in Part 1. Once again I used the Adam optimzer with lr=1e-4, reporting validation loss at the end of each epoch.
2 facial images which the network detects the nose correctly. Green=Ground Truth, Red=Predictions
2 facial images which the network detects the nose incorrectly. I think it fails in these 2 cases because the faces are not close to the ordinary position. One is shifted far to the left and angled heavily to one side, while the other is tilted with large smile and hands next to head, which is very irregular in this dataset. This is in contrast to the cases where it detects correctly, where the face is facing fowards in the center of the image.
I also visualize some of the learned filters in Convolutional layers of the CNN. The filters for the last layer can be found in the Appendix.
In the last part of this project I train on a much larger (and messier) dataset: ibug face in the wild. This dataset of 6666 images is annotated with bounding boxes around the relavant face in the image, as well as 68 facial keypoints. This means some of the preprocessing involves finding the relative offsets to extract the portion within the bounding box, resizing this portion to be 240x240 pixels for each image, and applying the same data augmentations as Part 2.
For this part I also used the powerful ResNet as my architecture instead of building a CNN from scratch. This particular model comes in multiple sizes. I started with ResNet18 and did a lot of experimentation for this part, but eventually found that ResNet50 performed better. To use these models, I load them in and replace the first and last layers with the correct input and output shapes (e.g. 136 out_features in the last layer to predict 68 points). I also used them with pretrained weights, such that I could efficiently fine-tune on the ibug dataset without training too much.
One last thing I experiemented with was using Richard Zhang's Making Convolutional Networks Shift-Invariant Again. This simple technique involves adding a blurring layer to replace normal operations (e.g. replace MaxPool with Max and a BlurPool). I found that this improved my results a bit.
In the end, my best model was a pretrained ResNet50 with Zhang's antialiasing, fine-tuned on the given dataset for 20 epochs with a learning rate of 5e-4, then trained for another 20 epochs with a learning rate of 1e-4. This achieved a final Mean Absolute Error of 5.99842 on the Kaggle Competition.
The model does really well for my face, which was taken in normal lighting, facing forwards, with a white background, which may have made it easier. It also does decently well on seraphine's face, since it doesn't look too abnormal compared to images in the training set (though it is inaccurate on one side of the face, maybe because of bad lighting). The model did, however, badly fail to predict the facial features of ford's face, particularly the eyes. I think this could be due to the very wrinkly skin above his eyes, the lack of good lighting on the eyes, and the relatively small part of the image that the face takes up.
As previously mentioned I experiemented with Zhang's antialiasing max pools and ended up using them in my final model. I found qualitatively it did seem to improve my results. I decided to benchmark on ResNet18 to compare the learning of the antialised version (blurred) vs the normal version.
I wasn't able to find any major differences, though it did seem like the blur version was a bit more stable (e.g. the blur model losses were monotonically decreasing while the normal losses went back up in some epochs). I think the differences between the two were more meaningful in the ResNet50 case, though I did not fully benchmark this.
In the following architectures, I place a ReLU layer after every layer except the last. I also use a Maxpool2D (size 2x2) layer after the ReLU of every convolutional layer.
Net( (conv1): Conv2d(1, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv2): Conv2d(12, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv3): Conv2d(24, 36, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fc1): Linear(in_features=2520, out_features=500, bias=True) (fc2): Linear(in_features=500, out_features=2, bias=True) )
Net( (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv2): Conv2d(6, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv3): Conv2d(12, 18, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (conv4): Conv2d(18, 24, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)) (conv5): Conv2d(24, 30, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)) (fc1): Linear(in_features=4950, out_features=2048, bias=True) (fc2): Linear(in_features=2048, out_features=512, bias=True) (fc3): Linear(in_features=512, out_features=116, bias=True) )
Learned filters:
ResNet( (conv1): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (maxpool): Sequential( (0): MaxPool2d(kernel_size=2, stride=1, padding=0, dilation=1, ceil_mode=False) (1): BlurPool( (pad): ReflectionPad2d([1, 2, 1, 2]) ) ) (layer1): Sequential( (0): Bottleneck( (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer2): Sequential( (0): Bottleneck( (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Sequential( (0): BlurPool( (pad): ReflectionPad2d([1, 2, 1, 2]) ) (1): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) ) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): BlurPool( (pad): ReflectionPad2d([1, 2, 1, 2]) ) (1): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer3): Sequential( (0): Bottleneck( (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Sequential( (0): BlurPool( (pad): ReflectionPad2d([1, 2, 1, 2]) ) (1): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) ) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): BlurPool( (pad): ReflectionPad2d([1, 2, 1, 2]) ) (1): Conv2d(512, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (2): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (3): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (4): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (5): Bottleneck( (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (layer4): Sequential( (0): Bottleneck( (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Sequential( (0): BlurPool( (pad): ReflectionPad2d([1, 2, 1, 2]) ) (1): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) ) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (downsample): Sequential( (0): BlurPool( (pad): ReflectionPad2d([1, 2, 1, 2]) ) (1): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (2): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) (2): Bottleneck( (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) ) ) (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (fc): Linear(in_features=2048, out_features=136, bias=True) )