CNNs (Convolutional Neural Networks) have emerged as the new state-of-the-art for image recognition. And their success comes from ability to learn from large quantities of labelled data.
How do we label the data?
We have image-level, point, bounding box and the most complex one, pixel labelled annotations.
The cost of data annotation gets higher as we more complexly label the data.
Manual Annotation Time [Bilen CVPR 18]
Weakly supervision comes to the rescue!
Weak supervision can be described as lower degree (or cheaper) annotation at train time than the required output at the test time
And problem of weak supervision is very important because,
- Image understanding aims at learning a growing body of complex visual concepts, every day there are more and more objects to be detected, more classes to be recognized.
- CNN training is data-hungry and image labelling is tedious. Thus Weak Supervision (WS) can reduce significantly the cost of data annotation -such as image segmentation, image captioning, or object detection-
For this paper, weakly supervised detection (WSD) is the problem of learning object detectors using only image-level labels.
The motivation behind the method proposed by Hakan Bilen and Andrea Vedaldi is that,
- CNNs should contain meaningful representations of the data
- There exists evidence that CNNs learn object and object parts in image classification [Zhou ICLR 15]
- Image level labels are plentiful
Image labeled as "Man"
Authors are not the first to address the problem of weak supervision [Wang ECCV 14]
- Uses a pre-trained CNN to describe image regions
- But comprises several components beyond the CNN and requires significant fine-tuning
There are some related work to this study which comprises of three line of research.
- The first one formulated WSD as multiple instance learning (MIL) where image is interpreted as a bag of regions.
- The second line of research is identifying the similarity between image parts [Song et al ICML 14]
- The last one is CNN based related works for example [Cinbis TPAMI 17] combines multi-fold MIL with CNN features.
So Bilen proposes a novel end-to-end method for weakly supervised object detection (WSOD) using pre-trained CNNs. And it is named obviously weakly supervised deep detection network (WSDDN)
Author's proposed network 'WSDDN'
Let’s start explaining the method.
The whole study can be partitioned into three parts,
- CNN pre-trained on a large-scale image classification task such as ImageNet ILSVRC 2012 data [Russakovsky IJCV 15] (no bounding box annotation)
- Construct WSDDN as an architectural modification of this CNN
- Train / Fine-tune the WSDDN on a target dataset using only image-level annotations
Modifications to pre-trained CNN
- Replace the last pooling layer with a spatial pyramid pooling [He ECCV 14, Lazebnik CVPR 16]
- Add a parallel branch to the classification one that contains a fully connected layer followed by a soft-max layer
- Combine the classification and detection streams by element-wise product of two feature vectors
Architecture of WSDDN
Spatial Pyramid Pooling (SPP)
The last pooling layer is replaced with a spatial pyramid pooling layer [He ECCB 14, Lazebnik CVPR 16]
- Region proposals are in different scales, SPP configures them to be compatible with the first fully-connected layer
Spatial Pyramid Pooling
Given an image x, candidate object regions R are obtained by a region proposal mechanism,
- Selective search windows (SSW) [Sande ICCV 11] and
- Edge Boxes (EB) [Zitnick ECCV 14]
methods are used.
Two Stream Architecture
The object detection task is divided into two sub-tasks with a two stream architecture
- Classification stream: assigns each region to a class
- Detection stream: picks most promising windows in an image given a class
Two streams, one for classification the other for detection
The two streams are then multiplied by an element-wise product. And a summation over regions to get an image-level prediction score is obtained. It is a sum of element-wise product of soft-max normalized scores over R regions thus it is in the range of (0 , 1)
Element-wise product and summation
Method is evaluated with three pre-trained CNN models as in [Girschick ICCV 2015],
- S (Small) : VGG-CNN-F which is similar to AlexNet with reduced number of convolutional filters [Chatfield BMVC 14]
- M (Medium) : VGG-CNN-M-1024 with same depth as S but has smaller stride in the first convolutional layer
- L (Large) : VGG-VD16 [Simonyan ICLR 15]
These models are modified to become WSDDNs, then trained on the PASCAL VOC datasets [Everingham IJCV 10]
Weakly supervised learning on PASCAL 07 , performance at test time is shown below,
Weak supervision levels compared to full supervision levels
We can see that fully supervised detection level is still very far away.
- It is an end-to-end learning with no custom deep learning layers
- State-of-the-art results with AlexNet (62% of supervised) is obtained
- Does not work well with deeper networks because it focuses on smaller regions with deeper networks. (An object part -e.g. person face- is detected instead of the object as a whole)
The paper can be found at
“Weakly Supervised Deep Detection Networks”
Author also has given a tutorial on “Weakly Supervised Object Detection” at CVPR 2018
“Tutorial on WSOD”